Learning to utilize image second-order derivative information for crisp edge detection

Changsong Liu1, , Wei Zhang1, Yanyan Liu2, Yimeng Fan1, , Mingyang Li1, , and Wenlin Li1
* Yanyan Liu is the corresponding author. 1School of Microelectronics, Tian** University, Tian**, 300072, China
2College of Electronic Information and Optical Engineering, Nankai University, Tian**, 300072, China
Abstract

Edge detection is a fundamental task in computer vision. It has made great progress under the development of deep convolutional neural networks (DCNNs), some of which have achieved a beyond human-level performance. However, recent top-performing edge detection methods tend to generate thick and noisy edge lines. In this work, we solve this problem from two aspects: (1) the lack of prior knowledge regarding image edges, and (2) the issue of imbalanced pixel distribution. We propose a second-order derivative-based multi-scale contextual enhancement module (SDMCM) to help the model locate true edge pixels accurately by introducing the edge prior knowledge. We also construct a hybrid focal loss function (HFL) to alleviate the imbalanced distribution issue. In addition, we employ the conditionally parameterized convolution (CondConv) to develop a novel boundary refinement module (BRM), which can further refine the final output edge maps. In the end, we propose a U-shape network named LUS-Net which is based on the SDMCM and BRM for crisp edge detection. We perform extensive experiments on three standard benchmarks, and the experiment results illustrate that our method can predict crisp and clean edge maps and achieves state-of-the-art performance on the BSDS500 dataset (ODS=0.829), NYUD-V2 dataset (ODS=0.768), and BIPED dataset (ODS=0.903).

Index Terms:
Edge detection, Convolutional neural network, Second-order derivative information, Hybrid focal loss function

I Introduction

Edge detection is a fundamental low-level problem in computer vision and image processing, playing a crucial role in a wide range of applications, from autonomous driving to medical imaging. It serves as the cornerstone for many high-level computer vision tasks by identifying the boundaries and structures within images, such as salient object detection [1, 2, 3], objection recognition [4, 5], and image segmentation [6, 7, 8]. With the rapid development of deep learning techniques, numerous researchers aim to employ deep convolutional neural networks (DCNN) for edge detection. Recent studies have proposed several remarkable DCNN-based methods, such as HED [9], RCF [10], and CHRNet [11]. These DCNN-based approaches have achieved state-of-the-art performance on a number of benchmarks including BSDS500 [12] and NYUD-V2 [13], which demonstrate the strong power of DCNN.

Refer to caption
Figure 1: The crispness of edge maps from HED, Laplacian, and our method. (a) is an image from the BSDS500 dataset. (b) is the prediction of HED, which is thick and blurred. (c) is the output of Laplacian, which can leverage the second-order derivative information to generate crisp boundaries but with more noise. (d) is the output from our method, which is crisp and clean.

However, there is still a major issue that needs to be addressed in edge detection: most modern DCNN-based models [9, 10, 14] are inclined to generate thick and noisy edge maps, such predictions introduce inaccuracies that compromise the spatial precision of high-level tasks, and it is frequently necessary to use a post-processing operation (NMS and morphological thinning) to obtain clean and precise edge maps. This issue can be decomposed into two subsidiary problems: (1) Excessive reliance on DCNN-based autonomous learning, without sufficient prior knowledge of edge information, which is essential for precise edge detection. (2) A highly imbalanced pixel distribution between edges and non-edges in an image, which presents difficulties in pixel classification.

In this work, we attempt to tackle this central issue by addressing the two subsidiary problems. Our goal is to make the network generate one-pixel width edge lines without using the post-processing operation. On the one hand, most DCNN-based methods [9, 10, 14, 15, 16] leverage the powerful feature extraction capabilities of deep learning for edge detection. However, these methods neglect the utilization of prior knowledge regarding image edges, which is typically determined by abrupt changes in pixels. Image second-order derivative captures such changes and it precisely locates the positions of edge pixels by the zero-crossing points. Therefore, leveraging this characteristic can help DCNNs to locate the true edge pixels more accurately. In addition, the mere introduction of second-order derivative information can result in noisy edge maps, as illustrated in Fig. 1 (c). In response to this circumstance, we combine the second-order derivative information with the multi-scale contextual information, develo** a second-order derivative-based multi-scale contextual enhancement module (SDMCM). The SDMCM is based on the traditional Laplacian operator [17] which can provide image second-order derivative cues, and dilated convolution with a large receptive field to capture the long-range multi-scale contextual information which can filter out the noise. The proposed SDMCM empowers the network to attain higher precision in true edge pixel localization, consequently generating crisp and clean edge maps.

On the other hand, most DCNN-based approaches employ a weighted cross-entropy loss function [9, 10, 14, 18] to solve the issue of imbalanced pixel distribution, but such a strategy struggles to distinguish between true edge and false edge pixels, leading to the misclassification of false positive (FP) pixels as true positive (TP) pixels; this also causes the predictions with thick and noisy edges. Therefore, we propose a novel hybrid focal loss (HFL) to control FPs and FNs, which can effectively reduce the impact of imbalanced pixel distribution. The HFL is based on the Tversky index [19] and focal loss [20] which constrains the network from image-level and pixel-level information, enhancing the accuracy in edge and non-edge pixel classification.

Furthermore, we employ the conditional parameterized convolution (CondConv) [21] to construct a boundary refinement module (BRM). CondConv can dynamically adjust convolutional kernel parameters according to the contextual information of input feature maps, and the BRM exploits this characteristic to improve the model adaptability to different image scenarios, thereby further suppressing the interference and refining the predicted edge maps. In the end, we propose a LUS-Net which is based on the U-shape architecture for crisp edge detection, and the whole network can be split into three parts: encoder, skip-connections, and decoder. The encoder consists of a lightweight pre-trained model [22, 23, 24] to enable efficient prediction. The SDMCM serves as the skip-connection component. The decoder employs a dense connection to cascade each BRM, allowing the model to learn more diverse and rich feature representations. In addition, we conduct a series of experiments to demonstrate the effectiveness of our method, some instances are shown in Fig. 1 (d).

In summary, the main contributions of our work can be summarized as follows:

  • 1.

    We build a second-order derivative-based multi-scale contextual enhancement module (SDMCM), which can help our model locate the true edge pixels more accurately, consequently generating crisp and clean edge maps.

  • 2.

    We construct a boundary refinement module (BRM) to further refine the edge maps, and propose a U-shape network named LUS-Net, which is based on the SDMCM and BRM, for crisp edge detection.

  • 3.

    We propose a novel hybrid focal loss (HFL) based on the Tversky index and focal loss, which can effectively alleviate the issue of imbalanced pixel distribution, resulting in suppressing misclassified false positive pixels near the true positive pixels.

  • 4.

    We conduct extensive experiments to demonstrate the advantages of our method and the results show that our method achieves the SOTA performance on three benchmark datasets.

The paper is structured as follows. Section II presents a review of related work in edge detection. Section III provides an elaboration of the LUS-Net, which involves SDMCM, BRM, and HFL. Section IV presents an analysis of our experiments. We evaluate the crispness of edges, provide detailed descriptions of the ablation study, analyze the function of each component, and compare our approach with recent state-of-the-art algorithms in edge detection. The final section, Section V, summarizes our proposed method and discusses future directions.

II Related work

Edge detection research spans over four decades, yielding extensive literature. In this section, we review some representative works which are split into two groups: traditional and deep learning-based methods.

Traditional methods: Early edge detection methods typically calculate image derivatives to produce edge maps. The Roberts operator [25] is a simple first-order derivative operator that detects edges by computing differences between diagonally adjacent pixels. The Sobel operator [26] is another first-order edge detector that uses two 3×3333\times 33 × 3 kernels to calculate image gradients in horizontal and vertical directions. The Canny [27] detector is a robust algorithm that detects edges by reducing noise, calculating gradients, and applying non-maximum suppression and hysteresis thresholding. The Laplacian detector [17] locates edge pixels by computing the second-order derivative of the image intensity, highlighting regions of rapid intensity change. Over time, researchers have improved edge detection by integrating texture, gradient, and other low-level features. Methods like Pb [28], gPb [12], and SE [29] use a classifier to generate object-level boundaries by utilizing these features. Despite their enhanced performance over derivative-based methods, they still rely on human-designed features and lack semantic information, limiting further improvement.

Deep learning-based methods: Deep learning-based techniques have significantly advanced the area of edge detection and play a crucial role in most state-of-the-art (SOTA) edge detection methods. SOTA edge detectors in recent years mainly adopt convolutional neural networks. These methods achieve significant performance with higher F-scores and some of them even surpass humans on benchmarks such as BSDS500 dataset. HED [9] presents a first end-to-end edge detection architecture, which is built on a fully convolutional VGG-16 network. They generate edge maps by fusing five side-output features with different scales and construct a weighted cross-entropy loss function to address the problem of imbalanced distribution. Based on HED, RCF [10] further utilizes multi-scale features, aggregating all features from different convolutional layers in each stage of VGG-16, improving the ability of the network to capture contextual information, and becoming the first to outperform humans on BSDS500. BDCN [14] proposes an innovative bi-directional cascade network architecture, which leverages information flow from shallow-to-deep and deep-to-shallow to capture multi-scale features within images comprehensively, significantly enhancing the performance. DexiNed [30], inspired by HED and Xception [31], can produce detailed edge maps that are visually appealing, without any pre-training or fine-tuning process. PiDiNet [18] provides a lightweight yet effective solution for edge detection by integrating traditional edge detection operators into vanilla convolution in modern DCNN.

As for precise edge detection, various methods have also been actively explored and have made their contributions. CED [32] focuses on guiding the network to predict sharp edge maps by employing sub-pixel convolution to upsample features. LPCB [33] explains the reason for edge thickness and proposes a new loss function based on the Dice coefficient [34] which enables the network to generate crisp boundaries without requiring post-processing. DSCD [35] aims to produce high-quality edge maps, they introduce a novel loss function inspired by SSIM [36] and build a dense connection [37] hyper-module based on dilated convolution [38] to achieve better performance. DRC [15] presents an innovative network through stacking refinement modules along with an adaptive weighting strategy for the optimal combination between different loss functions, resulting in a satisfactory performance. CATS [39] presents a context-aware tracing strategy which consists of a novel tracing loss and a context-aware fusion block for crisp edge detection. FCL-Net [40] proposes a novel network architecture designed to enhance the accuracy of edge detection through fine-scale corrective learning mechanisms. EDETR [41] is the first Transformer-based [42] edge detector which consists of two stages, they can extract well-defined boundaries of objects and meaningful edges by utilizing contextual information from the entire image, as well as detail local cues.

Our approach is motivated by the above pioneer works [33, 19, 17] to address the problem of edge thickness. We make the network produce crisp and clean edge maps by combining multi-scale contextual and second-order derivative information. Additionally, our method successfully addresses the issue of imbalanced pixel distribution by constructing a hybrid focal loss function, resulting in an excellent performance.

III Methodology

In this section, we present the LUS-Net in detail. The whole network can be divided into three parts as shown in Fig. 2: the top-down encoder, the skip-connection, and the bottom-up decoder. The top-down encoder comprises a lightweight pre-trained backbone, which can provide basic features with rich semantic information. The skip-connection consists of several second-order derivative-based multi-scale contextual enhancement modules (SDMCMs), and each SDMCM cascades to the corresponding stage of the backbone. The bottom-up decoder component is built on boundary refinement modules (BRMs), which introduce conditionally parameterized convolution (CondConv) [21] into edge detection for the first time. In particular, the decoder employs a dense connection to fuse different scale features, enhancing the expressive capability of the model. Additionally, our model is supervised using the hybrid focal loss (HFL) function. We provide the details of each component in the following subsections.

Refer to caption
Figure 2: The architecture of LUS-Net.

III-A Lightweight pre-trained backbone

Most SOTA edge detection methods adopt a pre-trained VGG [43] or ResNet [44] as a feature extraction backbone and then fine-tune them on the edge detection dataset. However, these backbones, with a large number of parameters, demand an expensive computational cost. To achieve efficient edge detection, we adopt and test three lightweight pre-trained backbones: MobileNetV2 [22], ShuffleNetV2 [23], and EfficientNetV2 [24]. These lightweight backbones have demonstrated comparable performance to VGG or ResNet networks while providing efficient inference capabilities. Specifically, these models employ a stride of 2 in the initial convolutional layer, resulting in excessive downsampling. This leads to feature maps with significantly reduced spatial resolution, damaging the quality of the generated contours. Therefore, we modify the stride from 2 to 1. Additionally, we remove the classifier head to make the backbone network suitable for edge detection, which simultaneously reduces the network parameters.

III-B Second-order derivative-based multi-scale contextual enhancement module

Image derivative information is used to locate edge pixels within an image, with commonly employed comprising first-order and second-order derivatives. The first-order derivative detects regions of rapid intensity change, indicating edges. The second-order derivative identifies the exact edge locations through zero-crossing points by measuring the rate of change of intensity. The information provided by the second-order derivative of an image is more precise than that of the first-order derivative because it is more sensitive to abrupt changes in pixel value, which typically corresponds to edge information. Given an input image I𝐼Iitalic_I, a specific pixel location in the image is represented as x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the formula for computing the first-order derivative of the image in the x-direction at x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is:

Ix|x=x0=I(x0+1)I(x0)evaluated-at𝐼𝑥𝑥subscript𝑥0𝐼subscript𝑥01𝐼subscript𝑥0\frac{\partial I}{\partial x}\big{|}_{x=x_{0}}=I(x_{0}+1)-I(x_{0})divide start_ARG ∂ italic_I end_ARG start_ARG ∂ italic_x end_ARG | start_POSTSUBSCRIPT italic_x = italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_I ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 ) - italic_I ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) (1)

and the formula for the second-order derivative of the image at x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is:

2fx2|x=x0=I(x0+1)+I(x01)2I(x0)evaluated-atsuperscript2𝑓superscript𝑥2𝑥subscript𝑥0𝐼subscript𝑥01𝐼subscript𝑥012𝐼subscript𝑥0\frac{\partial^{2}f}{\partial x^{2}}\big{|}_{x=x_{0}}=I(x_{0}+1)+I(x_{0}-1)-2I% (x_{0})divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG ∂ italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG | start_POSTSUBSCRIPT italic_x = italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_I ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + 1 ) + italic_I ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - 1 ) - 2 italic_I ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) (2)
Refer to caption
Figure 3: The process of image first-order derivative and second-order derivative.

The process of these two derivatives is shown in Fig. 3. For the first-order derivative, pixel values remain constant and have the same sign in areas of slow variation, changing only at sharp grayscale transitions. In contrast, the second-order derivative is zero in slow variation regions but shows opposite signs at abrupt changes, creating a one-pixel width edge, which is precisely the desired output we require. This accurately locates edge pixels, as the opposite signs enhance edge contrast, making the second-order derivative more effective for precise edge detection.

Relying solely on the second-order derivative for edge detection is insufficient because of its sensitivity to noise, which amplifies noise influence despite accurately locating edge pixels. To mitigate this, we propose a second-order derivative-based multi-scale contextual enhancement module (SDMCM). By introducing multi-scale contextual information, the receptive field expands, providing long-range semantic and structural information that helps the network distinguish true edge pixels from noise. True edge pixels are associated with objects or structures, while noise edges lack semantic coherence. Leveraging this information, the network can suppress noise edges and generate crisp edge maps, effectively addressing the thickness issue. The architecture of SDMCM can be seen in Fig. 4, and it can be divided into two parts: the multi-scale contextual path and the second-order derivative path.

Refer to caption
Figure 4: The diagram of SDMCM.

In the multi-scale contextual path, we first employ a channel compression operation 𝒞()𝒞\mathcal{C}(\cdot)caligraphic_C ( ⋅ ) which is a 1×1111\times 11 × 1 convolution to compress the input feature channel. This operation effectively prevents overfitting while simultaneously reducing the number of parameters. The compression ratio r{12,14,18}𝑟121418r\in\left\{\frac{1}{2},\frac{1}{4},\frac{1}{8}\right\}italic_r ∈ { divide start_ARG 1 end_ARG start_ARG 2 end_ARG , divide start_ARG 1 end_ARG start_ARG 4 end_ARG , divide start_ARG 1 end_ARG start_ARG 8 end_ARG }, we find that changes in compression rate have an impact on the crispness of the predicted edge maps (Section IV provides details). After the channel compression, we construct four parallel branches {1(),2(),3(),4()}subscript1subscript2subscript3subscript4\left\{\mathcal{B}_{1}(\cdot),\mathcal{B}_{2}(\cdot),\mathcal{B}_{3}(\cdot),% \mathcal{B}_{4}(\cdot)\right\}{ caligraphic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( ⋅ ) , caligraphic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( ⋅ ) , caligraphic_B start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( ⋅ ) , caligraphic_B start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( ⋅ ) } to produce multi-scale contextual features, and each branch consists of several 3×3333\times 33 × 3 dilated Conv-ReLU sequences. Introducing dilated convolution using a fixed dilated rate can broaden the receptive field, however, it also produces fixed feature intervals, which leads to blurred and discontinuous edges. Therefore, we stack a series of dilated convolutions with different dilated rates d{1,2,3}𝑑123d\in\left\{1,2,3\right\}italic_d ∈ { 1 , 2 , 3 }, allowing for filling intervals in the features. In this way, we can utilize all the features in continuous convolution kernels, without producing unused features or fixed intervals, and capturing long-range semantic information by expanding the receptive field. In the end, we combine the features from four branches by element-wise sum to obtain the mixed features which contain rich multi-scale contextual information.

We employ the original Laplacian template to build the second-order derivative path, which consists of a Laplacian-Batchnorm-ReLU sequence, a 3×3333\times 33 × 3 Conv-Batchnorm sequence, and a 1×1111\times 11 × 1 convolution. Additionally, we introduce a shortcut connection in both two paths to more effectively model complex non-linear transformations, thereby enhancing the expressive power of the model. In this situation, for an input two-dimensional feature map XH×W𝑋superscript𝐻𝑊X\in\mathcal{R}^{H\times W}italic_X ∈ caligraphic_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT, the output feature map YH×W𝑌superscript𝐻𝑊Y\in\mathcal{R}^{H\times W}italic_Y ∈ caligraphic_R start_POSTSUPERSCRIPT italic_H × italic_W end_POSTSUPERSCRIPT is obtained by

Y=𝑌absent\displaystyle Y=italic_Y = 𝒮[1(𝒞(X))+2(𝒞(X))+3(𝒞(X))+4(𝒞(X))\displaystyle\mathcal{S}\left[\mathcal{B}_{1}\left(\mathcal{C}(X)\right)+% \mathcal{B}_{2}\left(\mathcal{C}(X)\right)+\mathcal{B}_{3}\left(\mathcal{C}(X)% \right)+\mathcal{B}_{4}\left(\mathcal{C}(X)\right)\right.caligraphic_S [ caligraphic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( caligraphic_C ( italic_X ) ) + caligraphic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( caligraphic_C ( italic_X ) ) + caligraphic_B start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ( caligraphic_C ( italic_X ) ) + caligraphic_B start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( caligraphic_C ( italic_X ) ) (3)
+𝒞(X)]\displaystyle\left.+\mathcal{C}(X)\right]+ caligraphic_C ( italic_X ) ]

our newly developed SDMCM significantly enhances the capability of the model to locate the true edge pixels by combining the multi-scale contextual information and the second-order derivative information, consequently making our model generate crisp and precise edge maps.

III-C Boundary refinement module

We develop a novel boundary refinement module (BRM) as shown in Fig. 5, which is based on the conditionally parameterized convolution [21] (CondConv) to further refine the boundaries. CondConv can improve the model adaptability without significantly increasing the number of parameters by consolidating multiple convolutions into a single conditionally parameterized convolutional kernel. We leverage the unique characteristics of CondConv to build the BRM. This module can address the issue of illumination variations in an image, which leads to generating spurious edges.

The BRM consists of a residual block and two 3×3333\times 33 × 3 CondConv-Batchnorm-ReLU sequences. The residual block effectively facilitates information flow by summing operation, and two 3×3333\times 33 × 3 CondConv-Batchnorm-ReLU sequences further refine the features from the residual block, thereby improving the performance in edge detection. Our BRM can learn the contextual features of image illumination disturbances and modulate the convolutional kernel parameters accordingly, which aids in suppressing the spurious edge responses in these regions, resulting in more refined and precise edge maps. By harnessing the adaptive nature of CondConv, the BRM allows the model to adjust its convolutional behavior according to the input context dynamically, mitigating the impact of interference factors, and generating accurate and semantically consistent edge representations.

Refer to caption
Figure 5: The structure of BRM.

Additionally, we utilize 3×3333\times 33 × 3 depthwise separable convolution coupled with bilinear interpolation to upsample the resolution of feature maps from BRM. This combination enables us to increase the spatial dimensions of the feature maps while preserving important details and spatial relationships. In the end, we employ a dense connection to cascade each BRM, thereby integrating different scale features and allowing the model to fully leverage high-level semantic information to filter out the noise pixels.

III-D Hybrid focal loss

As a pixel-level binary classification task, edge detection encounters a significant challenge of imbalanced class distribution. The edge pixels typically comprise only about 10% of an image. Such an imbalanced issue causes great difficulties in training the network. Current mainstream methods use the weighted cross-entropy loss to tackle this, but it often results in thick boundaries [33]. To address the issue of imbalanced class distribution, we propose a hybrid focal loss function.

Imbalanced class distribution is a common issue in object detection and image segmentation. To address this, we draw on solutions from these two fields. Specifically, the focal loss [20] in object detection is effective in scenarios with extreme class imbalance and the need for fine detail detection, the Tversky index [19] in image segmentation shines in situations requiring specific trade-off between false positives and false negatives. These characteristics are crucial for edge detection.

The focal loss aims to tackle the issue of imbalanced data sample distribution, and we leverage its idea to address the problem of imbalanced pixel distribution in an image. In pixel-level binary classification, it is well-known that the standard cross-entropy loss function is used to evaluate the proximity of the predictions to the labels. However, in imbalanced pixel distribution, the loss is dominated by easily classified pixels from the majority class, neglecting the harder-to-learn minority class. In this situation, we introduce a focusing parameter γ0𝛾0\gamma\geq 0italic_γ ≥ 0 into cross-entropy to downweigh the loss for easy (well-classified) pixels and focus training on hard (misclassified) pixels. This ensures the model pays more attention to the minority class and improves its ability to correctly classify these challenging cases, alleviating the imbalanced pixel distribution. The focal loss can be written as:

LFL=αi=1N((1pi)γgilogpi+piγ(1gi)log(1pi))subscript𝐿𝐹𝐿𝛼superscriptsubscript𝑖1𝑁superscript1subscript𝑝𝑖𝛾subscript𝑔𝑖subscript𝑝𝑖superscriptsubscript𝑝𝑖𝛾1subscript𝑔𝑖1subscript𝑝𝑖\displaystyle L_{FL}=-\alpha\sum_{i=1}^{N}\left(\left(1-p_{i}\right)^{\gamma}g% _{i}\log p_{i}+p_{i}^{\gamma}\left(1-g_{i}\right)\log\left(1-p_{i}\right)\right)italic_L start_POSTSUBSCRIPT italic_F italic_L end_POSTSUBSCRIPT = - italic_α ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT ( 1 - italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) roman_log ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) (4)

where pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent the value of i𝑖iitalic_i-th pixel on a predicted edge map and its corresponding groundtruth image, respectively. N𝑁Nitalic_N is the total number of pixels in an image. (1pi)γsuperscript1subscript𝑝𝑖𝛾(1-p_{i})^{\gamma}( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT is a modulating factor and α𝛼\alphaitalic_α is a balance factor for positive and negative pixels.

The focal loss function is based on pixel error. When dealing with class imbalance, using such a strategy only will result in correct classification but insufficient accuracy, which leads to generating thick edge maps. To tackle the issue of thickness, we introduce the image-level information and construct a focal Tversky loss function based on the Tversky index. The Tversky index can be written as follows:

T(P,G;α,β)=|PG||PG|+α|P/G|+β|G/P|𝑇𝑃𝐺𝛼𝛽𝑃𝐺𝑃𝐺𝛼𝑃𝐺𝛽𝐺𝑃\displaystyle T\left(P,G;\alpha,\beta\right)=\frac{|PG|}{|PG|+\alpha|P/G|+% \beta|G/P|}italic_T ( italic_P , italic_G ; italic_α , italic_β ) = divide start_ARG | italic_P italic_G | end_ARG start_ARG | italic_P italic_G | + italic_α | italic_P / italic_G | + italic_β | italic_G / italic_P | end_ARG (5)

where P indicates the predicted edge maps and G indicates its corresponding labels. α𝛼\alphaitalic_α and β𝛽\betaitalic_β control the weight of false negatives (FNs) and false positives (FPs), respectively. By adjusting the value of α𝛼\alphaitalic_α and β𝛽\betaitalic_β, we can make a trade-off between FN pixels and FP pixels.

Inspired by the focal loss modification of the cross-entropy loss, we adopt a similar strategy and introduce γ𝛾\gammaitalic_γ into the Tversky index. The focal Tversky loss function can be defined as: {strip}

LFT=(i=1Npigi+αi=1N(pi(1gi))2+βi=1N((1pi)gi)2+Ci=1Npigi+C)γsubscript𝐿𝐹𝑇superscriptsuperscriptsubscript𝑖1𝑁subscript𝑝𝑖subscript𝑔𝑖𝛼superscriptsubscript𝑖1𝑁superscriptsubscript𝑝𝑖1subscript𝑔𝑖2𝛽superscriptsubscript𝑖1𝑁superscript1subscript𝑝𝑖subscript𝑔𝑖2𝐶superscriptsubscript𝑖1𝑁subscript𝑝𝑖subscript𝑔𝑖𝐶𝛾L_{FT}=\left(\frac{\sum_{i=1}^{N}p_{i}g_{i}+\alpha\sum_{i=1}^{N}\left({{p_{i}(% 1-g_{i})}}\right)^{2}+\beta\sum_{i=1}^{N}\left({(1-p_{i})g_{i}}\right)^{2}+C}{% \sum_{i=1}^{N}p_{i}g_{i}+C}\right)^{\gamma}italic_L start_POSTSUBSCRIPT italic_F italic_T end_POSTSUBSCRIPT = ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_α ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 1 - italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_β ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( ( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_C end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_C end_ARG ) start_POSTSUPERSCRIPT italic_γ end_POSTSUPERSCRIPT (6)

where pi(1gi)subscript𝑝𝑖1subscript𝑔𝑖p_{i}(1-g_{i})italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( 1 - italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and (1pi)gi1subscript𝑝𝑖subscript𝑔𝑖(1-p_{i})g_{i}( 1 - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent FPs and FNs, respectively. α+β=1𝛼𝛽1\alpha+\beta=1italic_α + italic_β = 1. γ𝛾\gammaitalic_γ is the focusing parameter to control the easy-training non-edge pixels and hard-training edge pixels, and C𝐶Citalic_C is a constant number to prevent the numerator/denominator from becoming 0. In this equation, we set α=0.3𝛼0.3\alpha=0.3italic_α = 0.3, β=0.7𝛽0.7\beta=0.7italic_β = 0.7, because the FN pixels are more important in edge detection. In addition, γ=0.75𝛾0.75\gamma=0.75italic_γ = 0.75 which places greater emphasis on low-accuracy predictions that have been misclassified, and C=1×107𝐶1superscript107C=1\times 10^{-7}italic_C = 1 × 10 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT. In the training process, the focal Tversky loss function maximum the value of i=1Npigisuperscriptsubscript𝑖1𝑁subscript𝑝𝑖subscript𝑔𝑖\sum_{i=1}^{N}p_{i}g_{i}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT which represents the TP pixels and thereby enables the model to locate the true edge pixels more accurately.

Our hybrid focal loss function is a weighted fusion of the above two loss functions, which can be defined as:

LHFL=LFT+λLFLsubscript𝐿𝐻𝐹𝐿subscript𝐿𝐹𝑇𝜆subscript𝐿𝐹𝐿\displaystyle L_{HFL}=L_{FT}+\lambda L_{FL}italic_L start_POSTSUBSCRIPT italic_H italic_F italic_L end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_F italic_T end_POSTSUBSCRIPT + italic_λ italic_L start_POSTSUBSCRIPT italic_F italic_L end_POSTSUBSCRIPT (7)

where we set λ=0.001𝜆0.001\lambda=0.001italic_λ = 0.001 to balance the weight between focal loss and focal Tversky loss. The hybrid focal loss function constrains the network from both pixel-level and image-level information. By leveraging the focal loss, the model concentrates on pixels that are challenging to classify while reducing the influence of well-classified pixels, resulting in alleviating the issue of imbalanced distribution. Simultaneously, by integrating the focal Tversky loss, the model can locate the true positive edge pixels more precisely and generate crisp boundaries by utilizing the global image-level information. Therefore, our hybrid focal loss function can tackle both class imbalance issues and thickness problems, consequently improving the performance of the model.

IV Experiments

In this section, we provide a comprehensive account of the implementation details, encompassing hyperparameters, the adopted datasets, and their augmentation strategy. We then introduce the two evaluation methods employed in this work, followed by a series of ablation experiments on our approach. Finally, we compare our proposed method with some state-of-the-art (SOTA) edge detection algorithms and demonstrate its superiority.

IV-A Implementation details

Our network is built using the Pytorch deep learning framework [45]. During the training phase, the hyperparameters are as follows: the mini-batch size is 8, the initial learning rate is 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT, the learning rate decay is 0.1, the weight decay is 5×1045superscript1045\times 10^{-4}5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and the number of training epochs is 40. We decay the learning rate every 5 epochs and adopt the Adam [46] method for optimization. All the experiments are performed using a single Tesla A40 GPU.

In addition, we test our method on three benchmarks: BSDS500 [12], NYUD-V2 [13], and BIPED [30]. The Berkeley Segmentation Dataset and Benchmarks 500 (BSDS500) is a comprehensive dataset that has been extensively utilized in the field of computer vision, particularly for evaluating edge detection algorithms. This dataset encompasses a diverse collection of 500 natural images, which are further divided into a standard split of 200 training, 100 validation, and 200 testing images. Each image in the BSDS500 is meticulously annotated by multiple human annotators (usually around 5 to 7), providing a rich set of groundtruth images that represents a wide range of human perception in terms of object boundaries. The NYU Depth Dataset V2 (NYUD-V2) is an influential benchmark that includes depth information and primarily consists of indoor scene images. This dataset comprises 1449 pairs of images, with each pair consisting of RGB and depth images. It is divided into three subsets: a training subset containing 381 images, a validation subset containing 414 images, and a testing subset comprising 654 images. The depth information can be encoded into three channels: horizontal disparity, height above ground, and angle with gravity (HHA). This allows for the storage of depth information in a three-channel RGB image, known as the HHA feature image. The Barcelona Images for Perceptual Edge Detection (BIPED) dataset is a high-quality dataset for evaluating perceptual edge detection algorithms. It contains 250 high-resolution images (1280×72012807201280\times 7201280 × 720 pixels) captured in outdoor scenes. The images are divided into a training set of 200 images and a test set of 50 images. During the training process, we merge the training and validation subsets from BSDS500 and NYUD-V2 into a single set, respectively. As for BIPED, we adopt their data settings.

As for data augmentation, we follow the previous works [9, 10, 33], crop and flip the image-label pairs by randomly rotating 24 angles, and all the three datasets employ the same augment strategy.

IV-B Evaluation methods

To evaluate the quality of generated edge maps, we report the F-score (2×Precision×RecallPrecision+Recall)2𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑅𝑒𝑐𝑎𝑙𝑙𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑅𝑒𝑐𝑎𝑙𝑙\left(\frac{2\times Precision\times Recall}{Precision+Recall}\right)( divide start_ARG 2 × italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n × italic_R italic_e italic_c italic_a italic_l italic_l end_ARG start_ARG italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n + italic_R italic_e italic_c italic_a italic_l italic_l end_ARG ), which is widely used in edge detection [9, 10, 33]. Specifically, Precision=TPTP+FP𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛𝑇𝑃𝑇𝑃𝐹𝑃Precision=\frac{TP}{TP+FP}italic_P italic_r italic_e italic_c italic_i italic_s italic_i italic_o italic_n = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_P end_ARG and Recall=TPTP+FN𝑅𝑒𝑐𝑎𝑙𝑙𝑇𝑃𝑇𝑃𝐹𝑁Recall=\frac{TP}{TP+FN}italic_R italic_e italic_c italic_a italic_l italic_l = divide start_ARG italic_T italic_P end_ARG start_ARG italic_T italic_P + italic_F italic_N end_ARG, where TP𝑇𝑃TPitalic_T italic_P, FP𝐹𝑃FPitalic_F italic_P, and FN𝐹𝑁FNitalic_F italic_N represent the number of correctly classified edge pixels, the number of incorrectly classified edge pixels, and the number of missed edge pixels, respectively. Once the network has generated edge maps, the primary step is to apply a threshold to convert them into binary edge maps, and then match these to the groundtruth images to calculate the F-score. There are two choices to set this threshold: optimal dataset scale (ODS) and optimal image scale (OIS). ODS F-score is calculated across the whole dataset by selecting a fixed threshold that maximizes the F-score for all images collectively. OIS F-score is calculated for each image independently, choosing the best threshold for that particular image to maximize its F-score. In addition, we adopt the same strategy as LPCB [33] used to evaluate our method: standard evaluation (S-Eval) and crispness evaluation (C-Eval).

Standard evaluation method. In previous studies, the predicted edge maps first underwent standard post-processing (NMS and morphological thinning), then matched these post-processed edge maps with their groundtruth images to calculate the ODS and OIS. This standard evaluation method is utilized to determine the correctness of predicted edge maps.

Crispness evaluation method. Our purpose is to make the network generate clearer and crisper edge maps. Therefore, we follow the settings in LPCB [33] to evaluate the crispness of our results, which calculate the ODS and OIS without such standard post-processing.

IV-C Ablation experiments on BSDS500 dataset

For the purpose of evaluating each sub-component performance in our model, we conduct a series of ablation experiments on the BSDS500 dataset and provide the analysis in detail. We report S-Eval and C-Eval for the performance evaluation and all these ablation experiments are conducted using MobileNetV2 as the encoder.

Loss function: Firstly, we perform ablation experiments on our hybrid loss function to demonstrate its effectiveness. To identify the optimal fusion of focal loss and focal Tversky loss, we conduct a series of comparison experiments about different λ𝜆\lambdaitalic_λ, and the comparison results are shown in Table I. It can be observed that, as λ𝜆\lambdaitalic_λ decreases, there is an increase in C-Eval; however, S-Eval does not exhibit a corresponding increase. This is because the influence of focal loss decreases. Therefore, we determine the optimal value of λ𝜆\lambdaitalic_λ is 0.001. Additionally, when fixed λ𝜆\lambdaitalic_λ, the best results of both S-Eval and C-Eval are obtained at β=0.7𝛽0.7\beta=0.7italic_β = 0.7, thereby the rest ablation experiments are adopted in this setting.

TABLE I: Comparison experiment results for different values of the hyperparameters β𝛽\betaitalic_β and λ𝜆\lambdaitalic_λ in hybrid focal loss.
hyperparameters S-Eval C-Eval
β𝛽\betaitalic_β λ𝜆\lambdaitalic_λ ODS OIS ODS OIS
0.7 1 0.805 0.825 0.686 0.694
0.7 0.1 0.807 0.827 0.685 0.694
0.7 0.01 0.809 0.828 0.691 0.696
0.7 0.001 0.805 0.827 0.698 0.705
0.6 0.001 0.802 0.822 0.685 0.691
0.8 0.001 0.803 0.824 0.689 0.694

Network components: The second ablation experiment aims to assess the effectiveness of each proposed component within our model. We first compare the weighted cross-entropy loss with our hybrid loss by training our model using each of these two loss functions separately. As shown in Table III, our hybrid loss function can obtain a higher F-score than the weighted cross-entropy loss in the C-Eval evaluation, which demonstrates the effectiveness of the proposed loss function. When removing individual modules, it is observed that there are varying degrees of decline in both S-Eval and C-Eval. It should be noted that the performance significantly decreases when BRM is excluded, further demonstrating the remarkable power of BRM. In addition to this, we substitute the Laplacian operator in SDMC with the first-order derivative-based Sobel and Scharr operators. The comparative results, presented in Table II, exhibit varying degrees of performance degradation respectively under these alternative setups. This confirms our view that second-order derivative information is better than first-order derivative information. On the other hand, our weighted cross-entropy-based model has a better performance than other models with the same loss function, such as HED and RCF, while having a lower computational cost. These experiment results fully demonstrate the validity of each component.

TABLE II: A comparative analysis of different derivative information.
Methods S-Eval C-Eval
ODS OIS ODS OIS
Sobel 0.803 0.824 0.690 0.697
Scharr 0.803 0.822 0.687 0.693
Laplacian 0.805 0.827 0.698 0.705
TABLE III: Ablation experiment results of each network component. WCE indicates the weighted cross-entropy loss function. HFL indicates the hybrid focal loss function. SDMCM indicates the second-order derivative-based multi-scale contextual enhancement module. BRM indicates the boundary refinement module. DC indicates the dense connection.
Methods WCE HFL SDMCM BRM DC S-Eval C-Eval
ODS OIS ODS OIS
Ours-MobileNetV2 0.812 0.833 0.693 0.702
0.805 0.827 0.698 0.705
0.803 0.822 0.686 0.692
0.794 0.813 0.647 0.654
0.804 0.825 0.687 0.693

Compression ratio analysis: We explore the effect of different channel compression ratios in SDMCM on the model performance and the experiment results are shown in Fig. 6. As the ratio increases, the number of model parameters also increases, resulting in a higher computational cost. However, the performance does not improve with an increasing compression ratio. The best performance is obtained at r=14𝑟14r=\frac{1}{4}italic_r = divide start_ARG 1 end_ARG start_ARG 4 end_ARG. The reason for this phenomenon is that a higher compression ratio causes overfitting, resulting in a lower performance both in S-Eval and C-Eval. Therefore, we set the compression ratio to 1414\frac{1}{4}divide start_ARG 1 end_ARG start_ARG 4 end_ARG, which achieves a balance between performance and parameters.

Refer to caption
Figure 6: The effect of different compression ratios on network performance. From top to down: (a) is S-Eval and (b) is C-Eval. The unit of measurement for ”Parameters” is megabytes (M).

IV-D Comparison to some other SOTA methods

In this subsection, we compare our algorithm with current SOTA methods by conducting an experiment using three datasets: BSDS500 [12], NYUD-V2 [13], and BIPED [30].

BSDS500: Firstly, we compare our method with some top-performing algorithms on BSDS500. We select a few recent state-of-the-art edge detectors which can be divided into two categories: the first category is methods without deep learning, which includes Canny [27], gPb-UCM [12] and SE [29]; the second category is approaches using deep learning technique, which including DeepContour [47], DeepEdge [48], HED [9], RCF [10], BDCN [14], CED [32], LPCB [33], DRC [15], PiDiNet [18], EDTER [41] and CHRNet [11]. We additionally reference the studies of other researchers who utilize extra training data sourced from the PASCAL VOC Context dataset [49] and adopt multi-scale testing. Table IV presents the ODS and OIS F-score in S-Eval and C-Eval. Fig. 7 shows some qualitative comparison results, and Fig. 8 draws the Precision-Recall curves.

Refer to caption
Figure 7: Some examples from state-of-the-art (SOTA) methods on BSDS500. From left to right: (a) is the original image, (b) is the groundtruth picture, (c), (d), (e), and (f) are the predictions of HED, RCF, CED, and DRC, respectively. (g) is the output edge map from ours-EfficientV2.

As shown in Fig. 7, HED and RCF are two classical algorithms for edge detection using VGG-16 [43] architecture, while their edge maps are noisy and blurred. CED and DRC share the same goal as ours, which aims to generate crisp edge maps without any post-processing. However, compared to the other two methods, our method can generate higher-quality and crisper edge maps. Specifically, in the third row of Fig. 7, the contours of the man’s eyes and nose are noisy, but our method can filter out these pixels, unlike the DRC which incorrectly classifies them. Similar results can be observed in other examples.

TABLE IV: Quantitative evaluation of comparison on BSDS500 dataset. VOC indicates training with extra PASCAL VOC Context data. MS indicates the multi-scale testing.
Methods S-Eval C-Eval
ODS OIS ODS OIS
Canny [27] 0.611 0.676 - -
gPb-UCM [12] 0.729 0.755 - -
SE [29] 0.743 0.764 - -
DeepEdge [48] 0.753 0.772 - -
DeepContour [47] 0.757 0.776 - -
HED [9] 0.788 0.808 0.576 0.591
DRC [15] 0.802 0.818 0.697 0.705
CED [32] 0.794 0.811 0.642 0.656
RCF [10] 0.798 0.815 0.585 0.604
LPCB [33] 0.800 0.816 0.693 0.700
BDCN [14] 0.806 0.826 0.636 0.650
PiDiNet [18] 0.789 0.803 0.578 0.587
EDTER [41] 0.824 0.841 0.698 0.706
CHRNet [11] 0.787 0.788 - -
Ours-ShuffleNetV2 0.790 0.812 0.670 0.677
Ours-MobileNetV2 0.805 0.827 0.698 0.705
Ours-EfficientNetV2 0.826 0.846 0.720 0.726
Ours-ShuffleNetV2-VOC 0.804 0.825 0.683 0.691
Ours-MobileNetV2-VOC 0.815 0.835 0.705 0.711
Ours-EfficientNetV2-VOC 0.827 0.846 0.717 0.723
Ours-ShuffleNetV2-VOC-MS 0.813 0.833 0.672 0.677
Ours-MobileNetV2-VOC-MS 0.821 0.842 0.684 0.690
Ours-EfficientNetV2-VOC-MS 0.829 0.851 0.694 0.700
Refer to caption
Figure 8: Precision-Recall curves on BSDS500 dataset. Our method achieves the best performance (ODS=0.829).

Table IV clearly shows our method using ShuffleNetV2 still achieves a competitive performance when trained with only the BSDS500 dataset, which outperforms HED and RCF in both S-Eval and C-Eval. When using the MobileNetV2 as the encoder, there is an increment in S-Eval and C-Eval whose ODS F-scores are improved from 0.790 to 0.805 and 0.670 to 0.698, respectively. The highest performance is achieved by using the EfficientNetV2 with the ODS=0.826 and OIS=0.846 in S-Eval. These results are better than those of EDTER, which is the first edge detection algorithm using Transformer, particularly considering our model has fewer parameters. In the standard evaluation, when we mix the extra PASCAL VOC Context data into the BSDS500 dataset, there is a significant improvement in the ODS and OIS. The corresponding values are 0.829 and 0.851, reaching the SOTA performance among all the methods. In the crispness evaluation, compared to the CED, LPCB, and DRC, our method using EfficientNetV2 trained with only the BSDS500 dataset demonstrates a significant performance margin, the ODS and OIS values are 0.720 and 0.726, respectively. Specifically, our ODS and OIS are 12.1% and 10.7% higher than those of CED, 3.9% and 3.7% higher than those of LPCB, 3.3% and 3.0% higher than those of DRC. These quantitative results are consistent with the qualitative visualization. All the comparison results demonstrate that our method has excellent performance, we successfully address the issue of edge thickness while improving the method’s performance. Fig. 8 shows that the performance of the human eye in edge detection is 0.803. Our results are better than the human level and obtain the best performance among the current SOTA methods.

NYUD-V2: Secondly, we select the NYUD-V2 dataset to conduct another set of comparison experiments. We adopt some methods as before, which consists of algorithms without using deep learning such as gPb-UCM [12], OEF [50], gPb+NG [51], SE [29] and SE+NG+ [52], and recent top edge detectors based on deep learning such as HED [9], RCF [10], BDCN [14], DRC [15], and CHRNet [11].

Refer to caption
Figure 9: Precision-Recall curves on NYUD-V2 dataset. Our method trained on the RGB data and the HHA data achieves a top performance (ODS=0.768).
Refer to caption
Figure 10: Some examples from state-of-the-art (SOTA) methods on NYUD-V2. From left to right: (a) is the RGB image, (b) is the HHA image, (c) is the groundtruth picture, (d) is the prediction of DRC, and (e) is the generated edge map of our method. Both the predictions of DRC and our method are only on the RGB image.

Since NYUD-V2 dataset consists of two types of images: RGB images and HHA images. We train and test our model on three versions following the previous researchers’ works: (a) RGB images only (Ours-RGB); (b) HHA images only (Ours-HHA); (c) directly averaging the predictions from the RGB version and the HHA version (Ours-RGB-HHA). Some predicted examples are shown in Fig. 10. The quantitative evaluation results are summarized in Table V and the Precision-Recall curves are drawn in Fig. 9.

TABLE V: Quantitative evaluation values of comparison on NYUD-V2 dataset. RGB indicates RGB images. HHA indicates HHA images. RGB-HHA indicates averaging the prediction of RGB images and HHA images. Ours indicates our method based on EfficientNetV2.
Methods S-Eval C-Eval
ODS OIS ODS OIS
gPb-UCM [12] 0.631 0.661 - -
OEF [50] 0.651 0.667 - -
gPb+NG [51] 0.687 0.716 - -
SE [29] 0.695 0.708 - -
SE+NG+ [52] 0.706 0.734 - -
HED-RGB [9] 0.722 0.737 0.387 0.404
HED-HHA [9] 0.691 0.704 0.335 0.350
HED-RGB-HHA [9] 0.746 0.764 0.368 0.384
RCF-RGB [10] 0.745 0.749 0.395 0.412
RCF-HHA [10] 0.701 0.702 0.333 0.348
RCF-RGB-HHA [10] 0.764 0.778 0.374 0.397
BDCN-RGB [14] 0.728 0.762 0.414 0.439
BDCN-HHA [14] 0.704 0.716 0.347 0.367
BDCN-RGB-HHA [14] 0.766 0.779 0.375 0.392
DRC-RGB [15] 0.749 0.762 0.411 0.455
DRC-RGB-HHA [15] 0.711 0.722 0.370 0.382
DRC-RGB-HHA [15] 0.769 0.782 0.403 0.436
CHRNet-RGB [11] 0.729 0.745 - -
CHRNet-HHA [11] 0.718 0.731 - -
CHRNet-RGB-HHA [11] 0.750 0.774 - -
Ours-RGB 0.757 0.768 0.546 0.559
Ours-HHA 0.717 0.727 0.435 0.450
Ours-RGB-HHA 0.768 0.780 0.513 0.524

As shown in Fig. 10, (d) and (e) are the output edge maps from DRC-RGB and Ours-RGB, respectively. The edge maps generated from our method exhibit consistent performance with those of predictions on BSDS500. The predictions from our method are cleaner and crisper than the top edge detector DRC, which demonstrates our effectiveness.

Table V illustrates that our method achieves impressive performance trained only on RGB images and HHA images, respectively. When we average the RGB version and HHA version, there is a noticeable improvement in S-Eval. The ODS achieves 0.768 and OIS achieves 0.780, which reaches the SOTA level. In crispness evaluation, our method obtains the best performance among other advanced methods. More precisely, ours-RGB is 32.8% and 22.9% higher than DRC-RGB in ODS and OIS, 31.9% and 27.3% higher than BDCN-RGB in ODS and OIS. Such a substantial increment fully demonstrates the effectiveness of our method in terms of edge crispness on the NYUD-V2 dataset. The Precision-Recall curves in Fig. 9 show that our method using EfficientNetV2 as the encoder has a top performance. These comparison results demonstrate the effectiveness of our method adequately.

TABLE VI: Quantitative comparison results on BIPED dataset. MS refers to multi-scale testing. Ours refers to our method based on EfficientNetV2.
Methods ODS OIS
RCF 0.849 0.861
BDCN 0.890 0.899
CATS 0.887 0.892
DexiNed 0.895 0.900
Ours 0.902 0.908
Ours-MS 0.903 0.909

BIPED: Finally, we further evaluate our method on the BIPED dataset. Since the resolution of each image is relatively high (1280×72012807201280\times 7201280 × 720), we crop each image-label pair randomly into 320×320320320320\times 320320 × 320 for training. There are currently limited methods that utilize the BIPED dataset due to its recent proposal. Therefore, we adopt four methods using deep learning for comparison which consists of RCF [10], BDCN [14], CATS [39], and DexiNed [30]. The quantitative comparison results are listed in Table VI and these results are similar to the performance shown on the BSDS500 and NYUD-V2. Our method obtains the highest performance on this benchmark. This is the first method to exceed 0.9 (ODS=0.903, OIS=0.909), demonstrating the powerful robustness of our method.

V Conclusion

In this work, we propose a simple yet effective U-shape network named LUS-Net for crisp edge detection. We leverage the second-order derivative information to help the model locate true edge pixels more accurately. In addition, we construct a novel hybrid focal loss function to solve the issue of imbalanced pixel distribution. We address the issue of edge thickness and our method can generate crisp and clean contours without any post-processing. The experiment results show that we achieve state-of-the-art performance on three standard benchmarks, which demonstrate the advantages and effectiveness of our method. However, there is still room for improvement in our method. We still rely on pre-trained backbones on ImageNet which results in a cumbersome training process. Therefore, we will focus on exploring how to train the network from scratch in the future.

Acknowledgments

This work is supported by the National Key Research and Development Program of China (No.2022YFC3006302).

References

  • [1] Andrea Manno-Kovacs. Direction selective contour detection for salient objects. IEEE Transactions on Circuits and Systems for Video Technology, 29(2):375–389, 2019.
  • [2] Aojun Gong, Junfei Nie, Chen Niu, Yuan Yu, Jun Li, and Lianbo Guo. Edge and skeleton guidance network for salient object detection in optical remote sensing images. IEEE Transactions on Circuits and Systems for Video Technology, 33(12):7109–7120, 2023.
  • [3] Zhengzheng Tu, Yan Ma, Chenglong Li, ** Tang, and Bin Luo. Edge-guided non-local fully convolutional network for salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 31(2):582–593, 2021.
  • [4] Shilpa Rani, Deepika Ghai, and Sandeep Kumar. Object detection and recognition using contour based edge detection and fast r-cnn. Multimedia Tools and Applications, 81(29):42183–42207, 2022.
  • [5] Qihan Feng, Zhiwen Shao, and Zhixiao Wang. Boundary-aware small object detection with attention and interaction. The Visual Computer, pages 1–14, 2023.
  • [6] Yuxin Sun, Li Su, Yongkang Luo, Hao Meng, Zhi Zhang, Wen Zhang, and Shouzheng Yuan. Irdclnet: Instance segmentation of ship images based on interference reduction and dynamic contour learning in foggy scenes. IEEE Transactions on Circuits and Systems for Video Technology, 32(9):6029–6043, 2022.
  • [7] Hao Feng, Keyi Zhou, Wengang Zhou, Yufei Yin, Jiajun Deng, Qi Sun, and Houqiang Li. Recurrent generic contour-based instance segmentation with progressive learning. IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2024.
  • [8] Rui Gu, Lituan Wang, and Lei Zhang. De-net: A deep edge network with boundary information for automatic skin lesion segmentation. Neurocomputing, 468:71–84, 2022.
  • [9] Saining Xie and Zhuowen Tu. Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015.
  • [10] Yun Liu, Ming-Ming Cheng, Xiaowei Hu, Kai Wang, and Xiang Bai. Richer convolutional features for edge detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3000–3009, 2017.
  • [11] Omar Elharrouss, Youssef Hmamouche, Assia Kamal Idrissi, Btissam El Khamlichi, and Amal El Fallah-Seghrouchni. Refined edge detection with cascaded and high-resolution convolutional network. Pattern Recognition, 138:109361, 2023.
  • [12] Pablo Arbeláez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5):898–916, 2011.
  • [13] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In European conference on computer vision, pages 746–760. Springer, 2012.
  • [14] Jianzhong He, Shiliang Zhang, Ming Yang, Yanhu Shan, and Tiejun Huang. Bi-directional cascade network for perceptual edge detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3828–3837, 2019.
  • [15] Y. J. Cao, C. Lin, and Y. J. Li. Learning crisp boundaries using deep refinement network and adaptive weighting loss. IEEE Transactions on Multimedia, 23:761–771, 2021.
  • [16] Zhong Qu, Sheng-Ye Wang, Ling Liu, and Dong-Yang Zhou. Visual cross-image fusion using deep neural networks for image edge detection. IEEE Access, 7:57604–57615, 2019.
  • [17] R Jain. Machine vision. McGRAW-HILL google schola, 2:323–331, 1995.
  • [18] Zhuo Su, Wenzhe Liu, Zitong Yu, Dewen Hu, Qing Liao, Qi Tian, Matti Pietikäinen, and Li Liu. Pixel difference networks for efficient edge detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5117–5127, 2021.
  • [19] Seyed Sadegh Mohseni Salehi, Deniz Erdogmus, and Ali Gholipour. Tversky loss function for image segmentation using 3d fully convolutional deep networks. In International workshop on machine learning in medical imaging, pages 379–387. Springer, 2017.
  • [20] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
  • [21] Brandon Yang, Gabriel Bender, Quoc V Le, and Jiquan Ngiam. Condconv: Conditionally parameterized convolutions for efficient inference. Advances in neural information processing systems, 32, 2019.
  • [22] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
  • [23] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pages 116–131, 2018.
  • [24] Mingxing Tan and Quoc Le. Efficientnetv2: Smaller models and faster training. In International conference on machine learning, pages 10096–10106. PMLR, 2021.
  • [25] Lawrence Gilman Roberts. Machine perception of three-dimensional soups. Massachusetts Institute of Technology, 2017, 1963.
  • [26] Irwin Edward Sobel. Camera models and machine perception. Stanford University, 1970.
  • [27] John Canny. A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence, (6):679–698, 1986.
  • [28] David R Martin, Charless C Fowlkes, and Jitendra Malik. Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE transactions on pattern analysis and machine intelligence, 26(5):530–549, 2004.
  • [29] Piotr Dollár and C Lawrence Zitnick. Fast edge detection using structured forests. IEEE transactions on pattern analysis and machine intelligence, 37(8):1558–1570, 2014.
  • [30] Xavier Soria, Angel Sappa, Patricio Humanante, and Arash Akbarinia. Dense extreme inception network for edge detection. Pattern Recognition, 139:109461, 2023.
  • [31] François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
  • [32] Yupei Wang, Xin Zhao, Yin Li, and Kaiqi Huang. Deep crisp boundaries: From boundaries to higher-level tasks. IEEE Transactions on Image Processing, 28(3):1285–1298, 2018.
  • [33] Ruoxi Deng, Chunhua Shen, Shengjun Liu, Huibing Wang, and Xinru Liu. Learning to predict crisp boundaries. In Proceedings of the European Conference on Computer Vision (ECCV), pages 562–578, 2018.
  • [34] Lee R Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):297–302, 1945.
  • [35] Ruoxi Deng and Shengjun Liu. Deep structural contour detection. In Proceedings of the 28th ACM international conference on multimedia, pages 304–312, 2020.
  • [36] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • [37] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
  • [38] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
  • [39] Linxi Huan, Nan Xue, Xianwei Zheng, Wei He, Jianya Gong, and Gui-Song Xia. Unmixing convolutional features for crisp edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6602–6609, 2021.
  • [40] Wenjie Xuan, Shaoli Huang, Juhua Liu, and Bo Du. Fcl-net: Towards accurate edge detection via fine-scale corrective learning. Neural Networks, 145:248–259, 2022.
  • [41] Mengyang Pu, Ya** Huang, Yuming Liu, Qingji Guan, and Haibin Ling. Edter: Edge detection with transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1402–1412, 2022.
  • [42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • [43] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [44] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [45] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
  • [46] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [47] Wei Shen, Xinggang Wang, Yan Wang, Xiang Bai, and Zhijiang Zhang. Deepcontour: A deep convolutional feature learned by positive-sharing loss for contour detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3982–3991, 2015.
  • [48] Gedas Bertasius, Jianbo Shi, and Lorenzo Torresani. Deepedge: A multi-scale bifurcated deep network for top-down contour detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4380–4389, 2015.
  • [49] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 891–898, 2014.
  • [50] Sam Hallman and Charless C Fowlkes. Oriented edge forests for boundary detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1732–1740, 2015.
  • [51] Saurabh Gupta, Pablo Arbelaez, and Jitendra Malik. Perceptual organization and recognition of indoor scenes from rgb-d images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 564–571, 2013.
  • [52] Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Malik. Learning rich features from rgb-d images for object detection and segmentation. In European conference on computer vision, pages 345–360. Springer, 2014.
[Uncaptioned image] Changsong Liu received the M.S. degree in Electronics and Communication Engineering from Tian** International Engineering Institute, Tian** University, China, in 2021. Currently, he is working toward the Ph.D. degree with the School of Microelectronics, Tian** University. His areas of research are artificial intelligence, computer vision, and machine learning.
[Uncaptioned image] Wei Zhang received the Ph.D. degree in Microelectronics and Solid-electronics from Tian** University, China, in 2002. He is currently a Professor with School of Microelectronics, Tian** University. His current research interests include artificial intelligence, data mining, data mining and digital image processing.
[Uncaptioned image] Yanyan Liu received the B.S. and M.S. degrees in electrical engineering from Tian** University, Tian**, China, in 1999 and 2002, respectively, and the Ph.D. degree in electrical engineering from Nankai University, Tian**, in 2010. In 2004, she joined the Optoelectronic Thin Film Device and Technology Research Institute, Nankai University, where she is currently an Associate Professor. Her current research interests include artificial intelligence and data mining.
[Uncaptioned image] Yimeng Fan received the B.S. degree in Electronic Science and Technology from Hebei University of Technology, China, in 2023. He is currently pursuing the M.S. degree in Electronic Science and Technology at Tian** University, China. His research interests include deep learning and digital image processing.
[Uncaptioned image] Mingyang Li received the M.S. degree in Measuring Technology and Instrument from School of Precision Instrument and Opto-electronics Engineering, Tian** University, China, in 2021. Currently, he is working toward the Ph.D degree with School of Microelectronics, Tian** University. His areas of research are artificial intelligence, computer vision and machine learning.
[Uncaptioned image] Wenlin Li received the B.S. degree from School of Microelectronics, Tian** University, in 2022. She is currently pursuing the M.S. degree with School of Microelectronics, Tian** University. Her areas of research are object segmentation, computer vision and deep learning.