11institutetext: School of Biomedical Engineering & State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University, Shanghai, China
11email: [email protected]
22institutetext: Australian Institute for Machine Learning, The University of Adelaide33institutetext: Lingang Laboratory, Shanghai, China44institutetext: Shanghai United Imaging Intelligence Co. Ltd., Shanghai, China55institutetext: Shanghai Clinical Research and Trial Center, Shanghai, China 66institutetext: Shanghai Ninth People’s Hospital, Shanghai Jiao Tong University, Shanghai, China 77institutetext: Shanghai Linkedcare Information Technology Co., Ltd., Shanghai, China

Cephalometric Landmark Detection across Ages with Prototypical Network

Han Wu * 11    Chong Wang* 22    Lanzhuju Mei 1133    Tong Yang 77    Min Zhu 66   
Dinggang Shen
114455
   Zhiming Cui(🖂) 11
Abstract

Automated cephalometric landmark detection is crucial in real-world orthodontic diagnosis. Current studies mainly focus on only adult subjects, neglecting the clinically crucial scenario presented by adolescents whose landmarks often exhibit significantly different appearances compared to adults. Hence, an open question arises about how to develop a unified and effective detection algorithm across various age groups, including adolescents and adults. In this paper, we propose CeLDA, the first work for Cephalometric Landmark Detection across Ages. Our method leverages a prototypical network for landmark detection by comparing image features with landmark prototypes. To tackle the appearance discrepancy of landmarks between age groups, we design new strategies for CeLDA to improve prototype alignment and obtain a holistic estimation of landmark prototypes from a large set of training images. Moreover, a novel prototype relation mining paradigm is introduced to exploit the anatomical relations between the landmark prototypes. Extensive experiments validate the superiority of CeLDA in detecting cephalometric landmarks on both adult and adolescent subjects. To our knowledge, this is the first effort toward develo** a unified solution and dataset for cephalometric landmark detection across age groups. Our code and dataset will be made public on Github.

Keywords:
Cephalometric Landmark Prototypical Network Landmark Prototypes Relation Mining Prototype Alignment

1 Introduction

Automatic and accurate detection of cephalometric landmarks holds significant importance in clinical practice, particularly for orthodontic diagnosis and therapy planning [14]. With the remarkable achievements of deep learning [13, 21], there are many learning-based efforts made for detecting cephalometric landmarks, i.e., regressing landmarks with deep convolutional neural networks [7, 10], improving detection performance with two-stage networks [6, 8, 16, 26], and modelling landmark relationships with anatomical prior information [2, 9].

Regardless of their encouraging performances, these existing approaches are mostly dedicated to detecting cephalometric landmarks on adult subjects, which has clear skull bone and regular tooth arrangement shown in Fig. 1(a), ignoring more challenging adolescent subjects that often have complicated morphological changes in anatomy due to the presence of unerupted and permanent teeth as in Fig. 1(b,c,d). Such changes are prone to cause significant shifts of the cephalometric landmarks [17]. Very recently, Ceph-Net [23] targets the cephalometric landmark detection on adolescent cases and utilizes an attention-based stacked regression network to progressively refine detection results. However, Ceph-Net considers only adolescent cases but the common adult cases are not included. To date, it remains unexplored and needs to be addressed to develop a unified and effective cephalometric landmark detection algorithm across different age groups, including both adolescent and adult cases. Generally, the main obstacle in approaching such an algorithm comes from the landmark shifts across age groups, necessitating robust learning capabilities of the algorithm.

Refer to caption
Figure 1: (a) An adult case, with regular anatomical structures and permanent teeth (orange arrow); (b,c,d) adolescent cases, with complicated anatomical changes due to unerupted teeth (blue arrow) and baby teeth (green arrow). These changes on adolescent cases cause significant landmark shifts. Here we show only two landmarks (red points) out of ten for better visualization.

In this paper, we propose CeLDA for age-inclusive cephalometric landmark detection. Specifically, our CeLDA relies on a prototypical network to realize landmark detection by comparing image features with landmark prototypes. To ensure robust prototypes against the landmark shifts from different age groups, we present new strategies for CeLDA to promote prototype alignment and obtain a holistic estimation of landmark prototypes from a large set of training samples. Furthermore, a novel prototype relation mining paradigm is introduced to leverage anatomical relations among landmarks. Extensive experimental results illustrate that our CeLDA outperforms existing state-of-the-art (SOTA) approaches in detecting cephalometric landmarks on adolescent subjects, adult subjects, and both. To summarise, our major contributions are: 1) the first prototype-based approach for age-inclusive cephalometric landmark detection, where the holistic prototypes are obtained to improve the learning robustness and predictive performance; 2) a novel prototype relation mining paradigm to take advantage of crucial anatomical relationships between landmarks; 3) a new comprehensive benchmark dataset for landmark detection that consists of cephalometric images from both adolescent and adult subjects.

Refer to caption
Figure 2: An overview of the proposed CeLDA method for cephalometric landmark detection, based on a set of holistic landmark prototypes.

2 Methodology

Our dataset 𝒟𝒟\mathcal{D}caligraphic_D comprises image-label pairs represented by (𝐱,𝐲)𝐱𝐲(\mathbf{x},\mathbf{y})( bold_x , bold_y ), where 𝐱H×W𝐱superscript𝐻𝑊\mathbf{x}\in\mathbb{R}^{H\times W}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT denotes a cephalometric image of size H×W𝐻𝑊H\times Witalic_H × italic_W, and 𝐲{0,1}K×H×W𝐲superscript01𝐾𝐻𝑊\mathbf{y}\in\{0,1\}^{{K\times H\times W}}bold_y ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_K × italic_H × italic_W end_POSTSUPERSCRIPT represents K𝐾Kitalic_K binary ground-truth landmark maps. Each of the K𝐾Kitalic_K landmark maps only has one single annotated landmark point, i.e., 𝐲k,:,:=1subscript𝐲𝑘::1\sum\mathbf{y}_{k,:,:}=1∑ bold_y start_POSTSUBSCRIPT italic_k , : , : end_POSTSUBSCRIPT = 1. Following existing approaches [2, 5, 25, 27], we transform the sparsely-distributed landmark maps into K𝐾Kitalic_K landmark heatmaps 𝐇k=Gaussian(𝐲k)H×Wsubscript𝐇𝑘Gaussiansubscript𝐲𝑘superscript𝐻𝑊\mathbf{H}_{k}=\text{Gaussian}(\mathbf{y}_{k})\in\mathbb{R}^{H\times W}bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = Gaussian ( bold_y start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT for model training, using a Gaussian smoothing strategy as in [2, 29].

2.1 Overview

An overview of our proposed method is shown in Fig. 2. For an input cephalometric image 𝐱𝐱\mathbf{x}bold_x, we employ a network backbone fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, i.e., U-Net [12], to extract multi-level high-resolution feature maps {𝐅1,𝐅2,𝐅3}subscript𝐅1subscript𝐅2subscript𝐅3\{\mathbf{F}_{1},\mathbf{F}_{2},\mathbf{F}_{3}\}{ bold_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT }, where 𝐅1H4×W4×D1subscript𝐅1superscript𝐻4𝑊4subscript𝐷1\mathbf{F}_{1}\in\mathbb{R}^{\frac{H}{4}\times\frac{W}{4}\times D_{1}}bold_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_H end_ARG start_ARG 4 end_ARG × divide start_ARG italic_W end_ARG start_ARG 4 end_ARG × italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝐅2H2×W2×D2subscript𝐅2superscript𝐻2𝑊2subscript𝐷2\mathbf{F}_{2}\in\mathbb{R}^{\frac{H}{2}\times\frac{W}{2}\times D_{2}}bold_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_H end_ARG start_ARG 2 end_ARG × divide start_ARG italic_W end_ARG start_ARG 2 end_ARG × italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, 𝐅3H1×W1×D3subscript𝐅3superscript𝐻1𝑊1subscript𝐷3\mathbf{F}_{3}\in\mathbb{R}^{\frac{H}{1}\times\frac{W}{1}\times D_{3}}bold_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_H end_ARG start_ARG 1 end_ARG × divide start_ARG italic_W end_ARG start_ARG 1 end_ARG × italic_D start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. In order to enable an accurate detection of the sparsely-distributed landmark, these feature maps are up-sampled to the original resolution of the input image and then concatenated into a composite feature map 𝐅=concat(up(𝐅1),up(𝐅2),up(𝐅3))H×W×D𝐅concatupsubscript𝐅1upsubscript𝐅2upsubscript𝐅3superscript𝐻𝑊𝐷\mathbf{F}=\text{concat}\left(\text{up}(\mathbf{F}_{1}),\text{up}(\mathbf{F}_{% 2}),\text{up}(\mathbf{F}_{3})\right)\in\mathbb{R}^{H\times W\times D}bold_F = concat ( up ( bold_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , up ( bold_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , up ( bold_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_D end_POSTSUPERSCRIPT, where D=(D1+D2+D3)𝐷subscript𝐷1subscript𝐷2subscript𝐷3D=(D_{1}+D_{2}+D_{3})italic_D = ( italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_D start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ) and up()up\text{up}(\cdot)up ( ⋅ ) denotes an up-sampling operation.

In Fig. 2(a), our CeLDA method leverages K𝐾Kitalic_K holistic prototypes 𝒫hol={𝐩khol}k=1Ksubscript𝒫𝑜𝑙superscriptsubscriptsubscriptsuperscript𝐩𝑜𝑙𝑘𝑘1𝐾\mathcal{P}_{hol}=\{\mathbf{p}^{hol}_{k}\}_{k=1}^{K}caligraphic_P start_POSTSUBSCRIPT italic_h italic_o italic_l end_POSTSUBSCRIPT = { bold_p start_POSTSUPERSCRIPT italic_h italic_o italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT that are estimated from a large set of training samples, as introduced in Section 2.2. In 𝒫holsubscript𝒫𝑜𝑙\mathcal{P}_{hol}caligraphic_P start_POSTSUBSCRIPT italic_h italic_o italic_l end_POSTSUBSCRIPT, each prototype 𝐩khol1×1×Dsubscriptsuperscript𝐩𝑜𝑙𝑘superscript11𝐷\mathbf{p}^{hol}_{k}\in\mathbb{R}^{1\times 1\times D}bold_p start_POSTSUPERSCRIPT italic_h italic_o italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × 1 × italic_D end_POSTSUPERSCRIPT corresponds to one landmark and captures robust landmark-representative features. After that, we derive K𝐾Kitalic_K similarity maps, see Fig. 2(b), by calculating the dot-product between the feature maps 𝐅𝐅\mathbf{F}bold_F and each prototype 𝐩kholsubscriptsuperscript𝐩𝑜𝑙𝑘\mathbf{p}^{hol}_{k}bold_p start_POSTSUPERSCRIPT italic_h italic_o italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, which is formulated as:

𝐒k=𝐩khol𝐅H×W.subscript𝐒𝑘subscriptsuperscript𝐩𝑜𝑙𝑘𝐅superscript𝐻𝑊\mathbf{S}_{k}=\mathbf{p}^{hol}_{k}\cdot\mathbf{F}\in\mathbb{R}^{H\times W}.bold_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = bold_p start_POSTSUPERSCRIPT italic_h italic_o italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ⋅ bold_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT . (1)

Finally, the detection prediction for the k𝑘kitalic_k-th landmark is obtained by selecting the location in 𝐒ksubscript𝐒𝑘\mathbf{S}_{k}bold_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT that has the highest similarity. For model training, the standard regression loss is utilized to supervise our CeLDA:

reg=1KkK𝐒k𝐇k22subscriptreg1𝐾subscriptsuperscript𝐾𝑘superscriptsubscriptnormsubscript𝐒𝑘subscript𝐇𝑘22\mathcal{L}_{\text{reg}}=\frac{1}{K}\sum^{K}_{k}||\mathbf{S}_{k}-\mathbf{H}_{k% }||_{2}^{2}caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | | bold_S start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (2)

where 𝐇ksubscript𝐇𝑘\mathbf{H}_{k}bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the k𝑘kitalic_k-th ground-truth landmark heatmap.

In the following section, we elaborate how to estimate and obtain the prototypes 𝒫holsubscript𝒫𝑜𝑙\mathcal{P}_{hol}caligraphic_P start_POSTSUBSCRIPT italic_h italic_o italic_l end_POSTSUBSCRIPT for robust cephalometric landmark detection across age groups.

2.2 Holistic Estimation of Landmark Prototypes

Prototypes have been studied for classification [15] and segmentation [28] for a long time [20], where their essence is representing classes by prototypes to encode class-representative features. Making an analogy to our landmark detection task, it is natural to define prototypes to represent landmarks, i.e., capturing landmark-representative features. To achieve this, we propose to first create instance-level landmark prototypes 𝒫ins={𝐩kins}k=1Ksubscript𝒫𝑖𝑛𝑠superscriptsubscriptsuperscriptsubscript𝐩𝑘𝑖𝑛𝑠𝑘1𝐾\mathcal{P}_{ins}=\{\mathbf{p}_{k}^{ins}\}_{k=1}^{K}caligraphic_P start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT = { bold_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_s end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT for each individual training image 𝐱𝐱\mathbf{x}bold_x, where 𝐩kins1×1×Dsuperscriptsubscript𝐩𝑘𝑖𝑛𝑠superscript11𝐷\mathbf{p}_{k}^{ins}\in\mathbb{R}^{1\times 1\times D}bold_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_s end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 1 × 1 × italic_D end_POSTSUPERSCRIPT and each of them is calculated as:

𝐩kins=i,j𝐇k(i,j)𝐅(i,j)i,j𝐇k(i,j)superscriptsubscript𝐩𝑘𝑖𝑛𝑠subscript𝑖𝑗subscript𝐇𝑘𝑖𝑗𝐅𝑖𝑗subscript𝑖𝑗subscript𝐇𝑘𝑖𝑗\mathbf{p}_{k}^{ins}=\frac{\sum_{i,j}\mathbf{H}_{k}(i,j)\cdot\mathbf{F}(i,j)}{% \sum_{i,j}\mathbf{H}_{k}(i,j)}bold_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_s end_POSTSUPERSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_i , italic_j ) ⋅ bold_F ( italic_i , italic_j ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT bold_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( italic_i , italic_j ) end_ARG (3)

where i{1,,H}𝑖1𝐻i\in\{1,...,H\}italic_i ∈ { 1 , … , italic_H } and j{1,,W}𝑗1𝑊j\in\{1,...,W\}italic_j ∈ { 1 , … , italic_W } are spatial indexes. From Eq. (3), the instance prototypes 𝒫inssubscript𝒫𝑖𝑛𝑠\mathcal{P}_{ins}caligraphic_P start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT are generated by averaging the local contextual features around the landmark point. Although straightforward and easy to implement, one noticeable shortcoming of the instance-level prototypes is that they consider only individual-image information, which is insufficient to encapsulate the drastic appearance variations of the cephalometric landmarks, particularly for different age groups. To overcome this problem, we propose a new strategy to achieve a holistic estimation of the landmark prototypes. Specifically, inspired by the well-established exponential moving averaging (EMA) technique, we obtain the holistic prototypes 𝒫holsubscript𝒫𝑜𝑙\mathcal{P}_{hol}caligraphic_P start_POSTSUBSCRIPT italic_h italic_o italic_l end_POSTSUBSCRIPT in an on-the-fly fashion by exploiting a large set of training samples, as formulated below:

𝒫hol(t+1)=α𝒫hol(t)+(1α)1||b=1||𝒫ins(t),b,superscriptsubscript𝒫𝑜𝑙𝑡1𝛼superscriptsubscript𝒫𝑜𝑙𝑡1𝛼1superscriptsubscript𝑏1superscriptsubscript𝒫𝑖𝑛𝑠𝑡𝑏\mathcal{P}_{hol}^{(t+1)}=\alpha\cdot\mathcal{P}_{hol}^{(t)}+(1-\alpha)\cdot% \frac{1}{|\mathcal{B}|}\sum_{b=1}^{|\mathcal{B}|}\mathcal{P}_{ins}^{(t),b},caligraphic_P start_POSTSUBSCRIPT italic_h italic_o italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = italic_α ⋅ caligraphic_P start_POSTSUBSCRIPT italic_h italic_o italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + ( 1 - italic_α ) ⋅ divide start_ARG 1 end_ARG start_ARG | caligraphic_B | end_ARG ∑ start_POSTSUBSCRIPT italic_b = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_B | end_POSTSUPERSCRIPT caligraphic_P start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) , italic_b end_POSTSUPERSCRIPT , (4)

where \mathcal{B}caligraphic_B denotes a training mini-batch with size |||\mathcal{B}|| caligraphic_B |, α𝛼\alphaitalic_α is a momentum update coefficient, t𝑡titalic_t indicates the training iteration, and 𝒫holsubscript𝒫𝑜𝑙\mathcal{P}_{hol}caligraphic_P start_POSTSUBSCRIPT italic_h italic_o italic_l end_POSTSUBSCRIPT is our holistic prototypes used for reliable detection of the landmarks, as described in Section 2.1. It is worth noting in Eq. (4) that during training our holistic prototypes are slowly progressing to take advantage of information from not only the current mini-batch but also historical prototypes. Therefore, they will gradually gain a global picture of the whole training set, allowing a robust landmark detection from cephalometric images across ages, such as adolescent and adult stages.

2.3 Cross-image Prototype Alignment

According to Eq. (4), our holistic prototypes are obtained by accumulating instance-level prototypes during training, to increase the prototype robustness we also propose to encourage prototype alignment across individual images:

align=1Kk=1K𝐩m,kins𝐩n,kins22,subscriptalign1𝐾superscriptsubscript𝑘1𝐾subscriptsuperscriptnormsuperscriptsubscript𝐩𝑚𝑘𝑖𝑛𝑠superscriptsubscript𝐩𝑛𝑘𝑖𝑛𝑠22\mathcal{L}_{\text{align}}=\frac{1}{K}\sum_{k=1}^{K}||\mathbf{p}_{m,k}^{ins}-% \mathbf{p}_{n,k}^{ins}||^{2}_{2},caligraphic_L start_POSTSUBSCRIPT align end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT | | bold_p start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_s end_POSTSUPERSCRIPT - bold_p start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_s end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (5)

where 𝐩m,kinssuperscriptsubscript𝐩𝑚𝑘𝑖𝑛𝑠\mathbf{p}_{m,k}^{ins}bold_p start_POSTSUBSCRIPT italic_m , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_s end_POSTSUPERSCRIPT and 𝐩n,kinssuperscriptsubscript𝐩𝑛𝑘𝑖𝑛𝑠\mathbf{p}_{n,k}^{ins}bold_p start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_s end_POSTSUPERSCRIPT denote the k𝑘kitalic_k-th instance-level prototypes for the image 𝐱msubscript𝐱𝑚\mathbf{x}_{m}bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and 𝐱nsubscript𝐱𝑛\mathbf{x}_{n}bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT within the mini-batch \mathcal{B}caligraphic_B, respectively. Notice that Eq. (5) is able to enforce prototype consistency for training samples from not only within the same age group but also across different age groups.

2.4 Masked Prototype Relation Mining

As illustrated in Fig. 1, landmarks naturally have crucial anatomical relations within a cephalometric image [24]. Given that our CeLDA harnesses prototypes to represent landmarks, we further present a novel prototype relation mining paradigm to exploit the anatomical dependency between landmarks.

Motivated by the great success of masked modeling in language [3] and vision [4] applications, in this paper, we propose to mask the instance-level landmark prototypes. As demonstrated in Fig. 2(c), after obtaining K𝐾Kitalic_K instance prototypes 𝒫inssubscript𝒫𝑖𝑛𝑠\mathcal{P}_{ins}caligraphic_P start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT from a training image, we randomly mask out a proportion of prototypes in 𝒫inssubscript𝒫𝑖𝑛𝑠\mathcal{P}_{ins}caligraphic_P start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT and replace them with zero, where the landmark positional embeddings are introduced as a location indicator. The combination of masked prototypes and positional embeddings is processed by a multi-head self-attention (MSA) layer, which reconstructs the masked prototypes as follows:

𝒫^ins=MSA(mask(𝒫ins)𝐄pos),𝐄pos=MLP(𝐲¯),formulae-sequencesubscript^𝒫𝑖𝑛𝑠MSAdirect-summasksubscript𝒫𝑖𝑛𝑠subscript𝐄𝑝𝑜𝑠subscript𝐄𝑝𝑜𝑠MLP¯𝐲\hat{\mathcal{P}}_{ins}=\text{MSA}\left(\text{mask}(\mathcal{P}_{ins})\oplus% \mathbf{E}_{pos}\right),\quad\mathbf{E}_{pos}=\text{MLP}(\mathbf{\bar{y}}),over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT = MSA ( mask ( caligraphic_P start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT ) ⊕ bold_E start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ) , bold_E start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT = MLP ( over¯ start_ARG bold_y end_ARG ) , (6)

where 𝒫^inssubscript^𝒫𝑖𝑛𝑠\hat{\mathcal{P}}_{ins}over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT is the reconstructed prototypes, mask()mask\text{mask}(\cdot)mask ( ⋅ ) is a mask-out operation to randomly exclude a ratio (denoted by R𝑅Ritalic_R) of prototypes from 𝒫inssubscript𝒫𝑖𝑛𝑠\mathcal{P}_{ins}caligraphic_P start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT, direct-sum\oplus represents the element-wise summation, and 𝐄possubscript𝐄𝑝𝑜𝑠\mathbf{E}_{pos}bold_E start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT denotes the landmark positional embeddings, encoding the ground-truth landmark coordinates 𝐲¯K×2¯𝐲superscript𝐾2\mathbf{\bar{y}}\in\mathbb{R}^{K\times 2}over¯ start_ARG bold_y end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × 2 end_POSTSUPERSCRIPT using a multi-layer perceptron (MLP), where the landmark coordinates 𝐲¯¯𝐲\mathbf{\bar{y}}over¯ start_ARG bold_y end_ARG can be easily derived from the ground-truth landmark maps 𝐲𝐲\mathbf{y}bold_y. The reconstructed prototypes are supervised by the original prototypes in 𝒫inssubscript𝒫𝑖𝑛𝑠\mathcal{P}_{ins}caligraphic_P start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT:

mine=k=1K𝐩^kins𝐩kins22,subscriptminesuperscriptsubscript𝑘1𝐾subscriptsuperscriptnormsuperscriptsubscript^𝐩𝑘𝑖𝑛𝑠superscriptsubscript𝐩𝑘𝑖𝑛𝑠22\mathcal{L}_{\text{mine}}=\sum_{k=1}^{K}||\hat{\mathbf{p}}_{k}^{ins}-\mathbf{p% }_{k}^{ins}||^{2}_{2},caligraphic_L start_POSTSUBSCRIPT mine end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT | | over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_s end_POSTSUPERSCRIPT - bold_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_s end_POSTSUPERSCRIPT | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (7)

where 𝐩^kinssuperscriptsubscript^𝐩𝑘𝑖𝑛𝑠\hat{\mathbf{p}}_{k}^{ins}over^ start_ARG bold_p end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_s end_POSTSUPERSCRIPT denotes a reconstructed prototype in 𝒫^inssubscript^𝒫𝑖𝑛𝑠\hat{\mathcal{P}}_{ins}over^ start_ARG caligraphic_P end_ARG start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT, and 𝐩kinssuperscriptsubscript𝐩𝑘𝑖𝑛𝑠\mathbf{p}_{k}^{ins}bold_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_n italic_s end_POSTSUPERSCRIPT is the corresponding raw prototype in 𝒫inssubscript𝒫𝑖𝑛𝑠\mathcal{P}_{ins}caligraphic_P start_POSTSUBSCRIPT italic_i italic_n italic_s end_POSTSUBSCRIPT. Relying on Eq. (7), our CeLDA can make full use of the structural information regarding landmark relations during the process of learning the instance-level prototypes for each training sample, benefiting its understanding of the anatomical landmark dependency.

2.5 Overall Training Objective

The overall optimization objective of CeLDA is defined as:

total=reg+λ1align+λ2mine,subscripttotalsubscriptregsubscript𝜆1subscriptalignsubscript𝜆2subscriptmine\mathcal{L}_{\text{total}}=\mathcal{L}_{\text{reg}}+\lambda_{1}\mathcal{L}_{% \text{align}}+\lambda_{2}\mathcal{L}_{\text{mine}},caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT align end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT mine end_POSTSUBSCRIPT , (8)

where λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are hyper-parameters to control the weight of the loss terms.

3 Experiments

3.1 Dataset and Evaluation Metric

3.1.1 Dataset:

For the task of cephalometric landmark detection across age groups, we collected a new benchmark dataset, named CephAdoAdu, with both adolescent and adult cases, distinguishing it from existing datasets that solely consist of either adolescent or adult cases. CephAdoAdu has a total of 1000 (500 adult cases, 500 adolescent cases) cephalometric X-ray images, acquired from eight clinical centers. Every cephalometric image underwent manual annotations to mark 10 typical landmarks, by an experienced dental radiologist with over ten years of expertise. Our new dataset has two advantages over existing ones: 1) a more clinically practical coverage of subjects across different age groups; 2) a larger number of annotated images, ensuring a comprehensive and faithful model evaluation. The whole dataset is randomly divided into training set (400 images), validation set (300 images), and testing set (300 images). Notice that our data split is evenly performed in terms of the adult and adolescent cases.

3.1.2 Evaluation Metric:

Following previous studies [18, 19], we evaluate the model performance with the two commonly-used metrics: 1) Mean Radial Error (MRE) computes the average Euclidean distance between the predicted and ground-truth landmarks; 2) Successful Detection Rate (SDR) is defined as the percentage of landmarks that are accurately detected within a range of 2.0 mm, 2.5 mm, 3.0 mm, and 4 mm from the ground-truth landmarks.

3.2 Implementation Details

Our CeLDA employs U-Net [12] as the network backbone fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. In Eq. (4), α𝛼\alphaitalic_α = 0.99 and the mini-batch size |||\mathcal{B}|| caligraphic_B | = 8. All images are resized to 512 ×\times× 512 as model input. Training images are augmented to introduce random changes in brightness, contrast, and Gaussian noise. Our CeLDA is optimized for a total of 150 training epochs, using SGD optimizer with a learning rate of 0.001 which is decreased by a factor of 0.1 per 50 epochs. λ1subscript𝜆1\lambda_{1}italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and λ2subscript𝜆2\lambda_{2}italic_λ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in Eq. (8) are set to 1.0 and 3.0 respectively. We have the mask ratio R𝑅Ritalic_R = 0.7 for the prototype relation mining. All methods were implemented in the PyTorch framework and trained on an NVIDIA Tesla A100 GPU with 40GB memory.

3.3 Comparison with SOTA Approaches

Table 1: Cephalometric landmark detection results with both adult and adolescent cases, only adult cases, and only adolescent cases, respectively.
Methods Adult + Adolescent Adult Adolescent
MRE \downarrow (mm, std.) SDR (%) \uparrow MRE \downarrow (mm, std.) SDR (%) \uparrow MRE \downarrow (mm, std.) SDR (%) \uparrow
2mm 2.5mm 3mm 4mm 2mm 2.5mm 3mm 4mm 2mm 2.5mm 3mm 4mm
Cascade RCNN [1] 2.31 (0.94) 61.47 73.20 81.13 90.77 2.19 (0.97) 59.93 72.13 80.47 90.80 2.43 (0.94) 63.00 74.27 81.80 90.73
SCN [11] 1.73 (1.06) 82.97 90.40 93.37 96.57 1.40 (0.48) 82.07 91.20 94.33 97.33 2.05 (1.70) 83.87 89.60 92.40 95.80
GU2Net [29] 1.69 (0.91) 80.33 88.13 91.47 95.57 1.46 (0.50) 80.27 88.80 92.07 96.33 1.93 (1.35) 80.40 87.47 90.87 94.80
Wu et al. [22] 1.34 (1.24) 87.17 91.93 95.57 97.10 1.13 (0.66) 86.60 92.13 95.00 97.80 1.55 (1.87) 87.73 91.73 94.13 96.40
CeLDA 1.05 (0.33) 89.13 93.60 96.17 98.67 1.10 (0.37) 88.33 92.93 96.20 98.80 1.00 (0.34) 89.93 94.27 96.13 98.53

We compare our CeLDA with the following typical landmark detection models. Cascade RCNN [1] detects cephalometric landmarks with a multi-stage object detection architecture to progressively eliminate noisy predictions. SCN [11] regresses landmark heatmaps with a fully convolutional network that considers the spatial configuration of landmarks. GU2Net [29] is a universal landmark detection method that solves multiple detection tasks with end-to-end training on mixed datasets. We also compare with the recent champion method, proposed by Wu et al. [22], in the MICCAI CL-Detection2023 leaderboard. It is worth noting that all the above approaches are designed for detecting landmarks from only adult images. To achieve comparison fairness, all these competing approaches employ the same image augmentation strategies mentioned in Section 3.2.

Table 1 provides landmark detection results on CephAdoAdu test set. As evident, our CeLDA consistently outperforms other competing approaches on both adult and adolescent cases, only adult cases, and only adolescent cases. In particular, our CeLDA exhibits more improvements (in both MRE and SDR metrics) for only adolescent cases compared with only adult cases, showing its strength in detecting more challenging adolescent cephalometric landmarks. Moreover, CeLDA also greatly surpasses other approaches in both adult and adolescent cases, verifying its effectiveness in detecting landmarks across age groups.

3.4 Analytical Ablation Studies

Table 2: Ablation analysis for our proposed CeLDA method.
regsubscriptreg\mathcal{L}_{\text{reg}}caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT alignsubscriptalign\mathcal{L}_{\text{align}}caligraphic_L start_POSTSUBSCRIPT align end_POSTSUBSCRIPT minesubscriptmine\mathcal{L}_{\text{mine}}caligraphic_L start_POSTSUBSCRIPT mine end_POSTSUBSCRIPT Adult + Adolescent Adult Adolescent
MRE \downarrow (mm, std.) SDR (%) \uparrow MRE \downarrow (mm, std.) SDR (%) \uparrow MRE \downarrow (mm, std.) SDR (%) \uparrow
2mm 2.5mm 3mm 4mm 2mm 2.5mm 3mm 4mm 2mm 2.5mm 3mm 4mm
\checkmark 1.16 (0.38) 86.77 92.00 95.30 98.10 1.16 (0.36) 85.93 91.73 95.60 98.53 1.17 (0.42) 87.60 92.27 95.00 97.67
\checkmark \checkmark 1.13 (0.34) 87.83 92.97 95.70 98.30 1.14 (0.36) 86.20 92.47 96.07 98.87 1.12 (0.44) 89.47 93.47 95.33 97.73
\checkmark \checkmark \checkmark 1.05 (0.33) 89.13 93.60 96.17 98.67 1.10 (0.37) 88.33 92.93 96.20 98.80 1.00 (0.34) 89.93 94.27 96.13 98.53

We perform ablation experiments to study the effectiveness of our prototype alignment alignsubscriptalign\mathcal{L}_{\text{align}}caligraphic_L start_POSTSUBSCRIPT align end_POSTSUBSCRIPT and prototype relation mining minesubscriptmine\mathcal{L}_{\text{mine}}caligraphic_L start_POSTSUBSCRIPT mine end_POSTSUBSCRIPT, with results given in Table 2. We observe that the baseline (using only regsubscriptreg\mathcal{L}_{\text{reg}}caligraphic_L start_POSTSUBSCRIPT reg end_POSTSUBSCRIPT) achieves an MRE of 1.16 mm on adult and adolescent cases, which reduces to 1.13 mm, upon the use of prototype alignment loss. Remarkable performance improvements can be observed when further utilizing the prototype relation mining strategy, showing the advantage of mining prototype relations to harness the anatomical dependency between landmarks. We show typical visual landmark detection results in Fig. LABEL:fig:ablation (a), where we observe progressive improvements with the incorporation of each key component in our method.

In Fig. LABEL:fig:ablation (b), we explore the sensitivity of our CeLDA to the mask ratio R𝑅Ritalic_R used for the landmark prototype relation mining. As evident, a small mask ratio is inadequate to mine the prototype relations, resulting in sub-optimal results. Conversely, a large mask ratio may lead CeLDA to reconstruct wrong landmark prototypes, causing a decline in predictive performance. According to Fig. LABEL:fig:ablation (b), we set the mask ratio at 0.7 in all other experiments.

4 Conclusion

In this work, we presented the CeLDA method to address cephalometric landmark detection across different age groups with the prototypical network. Our CeLDA detects cephalometric landmarks by comparing image features with a set of holistic landmark prototypes, where their anatomical relations are exploited with a masking-based mining strategy. Our CeLDA shows great superiority over existing approaches on adolescent and adult cases. We established and released the first cephalometric benchmark dataset covering a large number of both adult and adolescent cases, with the hope that it will provide a more comprehensive evaluation for the landmark detection community.

References

  • [1] Cai, Z., Vasconcelos, N.: Cascade r-cnn: Delving into high quality object detection. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 6154–6162 (2018)
  • [2] Chen, R., Ma, Y., Liu, L., Chen, N., Cui, Z., Wei, G., Wang, W.: Semi-supervised anatomical landmark detection via shape-regulated self-training. Neurocomputing 471, 335–345 (2022)
  • [3] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
  • [4] He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16000–16009 (2022)
  • [5] He, T., Yao, J., Tian, W., Yi, Z., Tang, W., Guo, J.: Cephalometric landmark detection by considering translational invariance in the two-stage framework. Neurocomputing 464, 15–26 (2021)
  • [6] Jiang, Y., Li, Y., Wang, X., Tao, Y., Lin, J., Lin, H.: Cephalformer: Incorporating global structure constraint into visual features for general cephalometric landmark detection. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 227–237. Springer (2022)
  • [7] Lee, H., Park, M., Kim, J.: Cephalometric landmark detection in dental x-ray images using convolutional neural networks. In: Medical imaging 2017: Computer-aided diagnosis. vol. 10134, pp. 494–499. SPIE (2017)
  • [8] Lee, J.H., Yu, H.J., Kim, M.j., Kim, J.W., Choi, J.: Automated cephalometric landmark detection with confidence regions using bayesian convolutional neural networks. BMC Oral Health 20, 1–10 (2020)
  • [9] Li, W., Lu, Y., Zheng, K., Liao, H., Lin, C., Luo, J., Cheng, C.T., Xiao, J., Lu, L., Kuo, C.F., et al.: Structured landmark detection via topology-adapting deep graph learning. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16. pp. 266–283. Springer (2020)
  • [10] Oh, K., Oh, I.S., Lee, D.W., et al.: Deep anatomical context feature learning for cephalometric landmark detection. IEEE Journal of Biomedical and Health Informatics 25(3), 806–817 (2020)
  • [11] Payer, C., Štern, D., Bischof, H., Urschler, M.: Integrating spatial configuration into heatmap regression based cnns for landmark localization. Medical Image Analysis 54, 207–219 (2019)
  • [12] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. pp. 234–241. Springer (2015)
  • [13] Schmidhuber, J.: Deep learning in neural networks: An overview. Neural Networks 61, 85–117 (2015)
  • [14] Schwendicke, F., Chaurasia, A., Arsiwala, L., Lee, J.H., Elhennawy, K., Jost-Brinkmann, P.G., Demarco, F., Krois, J.: Deep learning for cephalometric landmark detection: systematic review and meta-analysis. Clinical Oral Investigations 25(7), 4299–4309 (2021)
  • [15] Snell, J., Swersky, K., Zemel, R.: Prototypical networks for few-shot learning. Advances in Neural Information Processing Systems 30 (2017)
  • [16] Song, Y., Qiao, X., Iwamoto, Y., Chen, Y.w.: Automatic cephalometric landmark detection on x-ray images using a deep-learning method. Applied Sciences 10(7),  2547 (2020)
  • [17] Tanikawa, C., Yamamoto, T., Yagi, M., Takada, K.: Automatic recognition of anatomic features on cephalograms of preadolescent children. The Angle Orthodontist 80(5), 812–820 (2010)
  • [18] Wang, C.W., Huang, C.T., Hsieh, M.C., Li, C.H., Chang, S.W., Li, W.C., Vandaele, R., Marée, R., Jodogne, S., Geurts, P., et al.: Evaluation and comparison of anatomical landmark detection methods for cephalometric x-ray images: a grand challenge. IEEE Transactions on Medical Imaging 34(9), 1890–1900 (2015)
  • [19] Wang, C.W., Huang, C.T., Lee, J.H., Li, C.H., Chang, S.W., Siao, M.J., Lai, T.M., Ibragimov, B., Vrtovec, T., Ronneberger, O., et al.: A benchmark for comparison of dental radiography analysis algorithms. Medical Image Analysis 31, 63–76 (2016)
  • [20] Wang, C., Chen, Y., Liu, F., Elliott, M., Kwok, C.F., Peña-Solorzano, C., Frazer, H., McCarthy, D.J., Carneiro, G.: An interpretable and accurate deep-learning diagnosis framework modelled with fully and semi-supervised reciprocal learning. IEEE Transactions on Medical Imaging (2023)
  • [21] Wang, C., Cui, Z., Yang, J., Han, M., Carneiro, G., Shen, D.: Bowelnet: Joint semantic-geometric ensemble learning for bowel segmentation from both partially and fully labeled ct images. IEEE Transactions on Medical Imaging 42(4), 1225–1236 (2022)
  • [22] Wu, Q., Yeo, S.Y., Chen, Y., Liu, J.: Revisiting cephalometric landmark detection from the view of human pose estimation with lightweight super-resolution head. arXiv preprint arXiv:2309.17143 (2023)
  • [23] Yang, S., Song, E.S., Lee, E.S., Kang, S.R., Yi, W.J., Lee, S.P.: Ceph-net: automatic detection of cephalometric landmarks on scanned lateral cephalograms from children and adolescents using an attention-based stacked regression network. BMC Oral Health 23(1),  803 (2023)
  • [24] Yao, Q., Quan, Q., Xiao, L., Kevin Zhou, S.: One-shot medical landmark detection. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part II 24. pp. 177–188. Springer (2021)
  • [25] Yueyuan, A., Hong, W.: Swin transformer combined with convolutional encoder for cephalometric landmarks detection. In: 2021 18th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP). pp. 184–187. IEEE (2021)
  • [26] Zeng, M., Yan, Z., Liu, S., Zhou, Y., Qiu, L.: Cascaded convolutional networks for automatic cephalometric landmark detection. Medical Image Analysis 68, 101904 (2021)
  • [27] Zhong, Z., Li, J., Zhang, Z., Jiao, Z., Gao, X.: An attention-guided deep regression model for landmark detection in cephalograms. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part VI 22. pp. 540–548. Springer (2019)
  • [28] Zhou, T., Wang, W., Konukoglu, E., Van Gool, L.: Rethinking semantic segmentation: A prototype view. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2582–2593 (2022)
  • [29] Zhu, H., Yao, Q., Xiao, L., Zhou, S.K.: You only learn once: Universal anatomical landmark detection. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24. pp. 85–95. Springer (2021)