Learning to utilize image second-order derivative information for crisp edge detection

Changsong Liu1, , Wei Zhang1, Yanyan Liu2, Yimeng Fan1, , Mingyang Li1, , and Wenlin Li1
* Yanyan Liu is the corresponding author. 1School of Microelectronics, Tian** University, Tian**, 300072, China
2College of Electronic Information and Optical Engineering, Nankai University, Tian**, 300072, China

Abstract

Edge detection is a fundamental task in computer vision. It has made great progress under the development of deep convolutional neural networks (DCNNs), some of which have achieved a beyond human-level performance. However, recent top-performing edge detection methods tend to generate thick and noisy edge lines. In this work, we solve this problem from two aspects: (1) the lack of prior knowledge regarding image edges, and (2) the issue of imbalanced pixel distribution. We propose a second-order derivative-based multi-scale contextual enhancement module (SDMCM) to help the model locate true edge pixels accurately by introducing the edge prior knowledge. We also construct a hybrid focal loss function (HFL) to alleviate the imbalanced distribution issue. In addition, we employ the conditionally parameterized convolution (CondConv) to develop a novel boundary refinement module (BRM), which can further refine the final output edge maps. In the end, we propose a U-shape network named LUS-Net which is based on the SDMCM and BRM for crisp edge detection. We perform extensive experiments on three standard benchmarks, and the experiment results illustrate that our method can predict crisp and clean edge maps and achieves state-of-the-art performance on the BSDS500 dataset (ODS=0.829), NYUD-V2 dataset (ODS=0.768), and BIPED dataset (ODS=0.903).

Index Terms:

Edge detection, Convolutional neural network, Second-order derivative information, Hybrid focal loss function

I Introduction

Edge detection is a fundamental low-level problem in computer vision and image processing, playing a crucial role in a wide range of applications, from autonomous driving to medical imaging. It serves as the cornerstone for many high-level computer vision tasks by identifying the boundaries and structures within images, such as salient object detection [1, 2, 3], objection recognition [4, 5], and image segmentation [6, 7, 8]. With the rapid development of deep learning techniques, numerous researchers aim to employ deep convolutional neural networks (DCNN) for edge detection. Recent studies have proposed several remarkable DCNN-based methods, such as HED [9], RCF [10], and CHRNet [11]. These DCNN-based approaches have achieved state-of-the-art performance on a number of benchmarks including BSDS500 [12] and NYUD-V2 [13], which demonstrate the strong power of DCNN.

Refer to caption — Figure 1: The crispness of edge maps from HED, Laplacian, and our method. (a) is an image from the BSDS500 dataset. (b) is the prediction of HED, which is thick and blurred. (c) is the output of Laplacian, which can leverage the second-order derivative information to generate crisp boundaries but with more noise. (d) is the output from our method, which is crisp and clean.

However, there is still a major issue that needs to be addressed in edge detection: most modern DCNN-based models [9, 10, 14] are inclined to generate thick and noisy edge maps, such predictions introduce inaccuracies that compromise the spatial precision of high-level tasks, and it is frequently necessary to use a post-processing operation (NMS and morphological thinning) to obtain clean and precise edge maps. This issue can be decomposed into two subsidiary problems: (1) Excessive reliance on DCNN-based autonomous learning, without sufficient prior knowledge of edge information, which is essential for precise edge detection. (2) A highly imbalanced pixel distribution between edges and non-edges in an image, which presents difficulties in pixel classification.

In this work, we attempt to tackle this central issue by addressing the two subsidiary problems. Our goal is to make the network generate one-pixel width edge lines without using the post-processing operation. On the one hand, most DCNN-based methods [9, 10, 14, 15, 16] leverage the powerful feature extraction capabilities of deep learning for edge detection. However, these methods neglect the utilization of prior knowledge regarding image edges, which is typically determined by abrupt changes in pixels. Image second-order derivative captures such changes and it precisely locates the positions of edge pixels by the zero-crossing points. Therefore, leveraging this characteristic can help DCNNs to locate the true edge pixels more accurately. In addition, the mere introduction of second-order derivative information can result in noisy edge maps, as illustrated in Fig. 1 (c). In response to this circumstance, we combine the second-order derivative information with the multi-scale contextual information, develo** a second-order derivative-based multi-scale contextual enhancement module (SDMCM). The SDMCM is based on the traditional Laplacian operator [17] which can provide image second-order derivative cues, and dilated convolution with a large receptive field to capture the long-range multi-scale contextual information which can filter out the noise. The proposed SDMCM empowers the network to attain higher precision in true edge pixel localization, consequently generating crisp and clean edge maps.

On the other hand, most DCNN-based approaches employ a weighted cross-entropy loss function [9, 10, 14, 18] to solve the issue of imbalanced pixel distribution, but such a strategy struggles to distinguish between true edge and false edge pixels, leading to the misclassification of false positive (FP) pixels as true positive (TP) pixels; this also causes the predictions with thick and noisy edges. Therefore, we propose a novel hybrid focal loss (HFL) to control FPs and FNs, which can effectively reduce the impact of imbalanced pixel distribution. The HFL is based on the Tversky index [19] and focal loss [20] which constrains the network from image-level and pixel-level information, enhancing the accuracy in edge and non-edge pixel classification.

Furthermore, we employ the conditional parameterized convolution (CondConv) [21] to construct a boundary refinement module (BRM). CondConv can dynamically adjust convolutional kernel parameters according to the contextual information of input feature maps, and the BRM exploits this characteristic to improve the model adaptability to different image scenarios, thereby further suppressing the interference and refining the predicted edge maps. In the end, we propose a LUS-Net which is based on the U-shape architecture for crisp edge detection, and the whole network can be split into three parts: encoder, skip-connections, and decoder. The encoder consists of a lightweight pre-trained model [22, 23, 24] to enable efficient prediction. The SDMCM serves as the skip-connection component. The decoder employs a dense connection to cascade each BRM, allowing the model to learn more diverse and rich feature representations. In addition, we conduct a series of experiments to demonstrate the effectiveness of our method, some instances are shown in Fig. 1 (d).

In summary, the main contributions of our work can be summarized as follows:

1.

We build a second-order derivative-based multi-scale contextual enhancement module (SDMCM), which can help our model locate the true edge pixels more accurately, consequently generating crisp and clean edge maps.
2.

We construct a boundary refinement module (BRM) to further refine the edge maps, and propose a U-shape network named LUS-Net, which is based on the SDMCM and BRM, for crisp edge detection.
3.

We propose a novel hybrid focal loss (HFL) based on the Tversky index and focal loss, which can effectively alleviate the issue of imbalanced pixel distribution, resulting in suppressing misclassified false positive pixels near the true positive pixels.
4.

We conduct extensive experiments to demonstrate the advantages of our method and the results show that our method achieves the SOTA performance on three benchmark datasets.

The paper is structured as follows. Section II presents a review of related work in edge detection. Section III provides an elaboration of the LUS-Net, which involves SDMCM, BRM, and HFL. Section IV presents an analysis of our experiments. We evaluate the crispness of edges, provide detailed descriptions of the ablation study, analyze the function of each component, and compare our approach with recent state-of-the-art algorithms in edge detection. The final section, Section V, summarizes our proposed method and discusses future directions.

II Related work

Edge detection research spans over four decades, yielding extensive literature. In this section, we review some representative works which are split into two groups: traditional and deep learning-based methods.

Traditional methods: Early edge detection methods typically calculate image derivatives to produce edge maps. The Roberts operator [25] is a simple first-order derivative operator that detects edges by computing differences between diagonally adjacent pixels. The Sobel operator [26] is another first-order edge detector that uses two $3\times 3$ kernels to calculate image gradients in horizontal and vertical directions. The Canny [27] detector is a robust algorithm that detects edges by reducing noise, calculating gradients, and applying non-maximum suppression and hysteresis thresholding. The Laplacian detector [17] locates edge pixels by computing the second-order derivative of the image intensity, highlighting regions of rapid intensity change. Over time, researchers have improved edge detection by integrating texture, gradient, and other low-level features. Methods like Pb [28], gPb [12], and SE [29] use a classifier to generate object-level boundaries by utilizing these features. Despite their enhanced performance over derivative-based methods, they still rely on human-designed features and lack semantic information, limiting further improvement.

Deep learning-based methods: Deep learning-based techniques have significantly advanced the area of edge detection and play a crucial role in most state-of-the-art (SOTA) edge detection methods. SOTA edge detectors in recent years mainly adopt convolutional neural networks. These methods achieve significant performance with higher F-scores and some of them even surpass humans on benchmarks such as BSDS500 dataset. HED [9] presents a first end-to-end edge detection architecture, which is built on a fully convolutional VGG-16 network. They generate edge maps by fusing five side-output features with different scales and construct a weighted cross-entropy loss function to address the problem of imbalanced distribution. Based on HED, RCF [10] further utilizes multi-scale features, aggregating all features from different convolutional layers in each stage of VGG-16, improving the ability of the network to capture contextual information, and becoming the first to outperform humans on BSDS500. BDCN [14] proposes an innovative bi-directional cascade network architecture, which leverages information flow from shallow-to-deep and deep-to-shallow to capture multi-scale features within images comprehensively, significantly enhancing the performance. DexiNed [30], inspired by HED and Xception [31], can produce detailed edge maps that are visually appealing, without any pre-training or fine-tuning process. PiDiNet [18] provides a lightweight yet effective solution for edge detection by integrating traditional edge detection operators into vanilla convolution in modern DCNN.

As for precise edge detection, various methods have also been actively explored and have made their contributions. CED [32] focuses on guiding the network to predict sharp edge maps by employing sub-pixel convolution to upsample features. LPCB [33] explains the reason for edge thickness and proposes a new loss function based on the Dice coefficient [34] which enables the network to generate crisp boundaries without requiring post-processing. DSCD [35] aims to produce high-quality edge maps, they introduce a novel loss function inspired by SSIM [36] and build a dense connection [37] hyper-module based on dilated convolution [38] to achieve better performance. DRC [15] presents an innovative network through stacking refinement modules along with an adaptive weighting strategy for the optimal combination between different loss functions, resulting in a satisfactory performance. CATS [39] presents a context-aware tracing strategy which consists of a novel tracing loss and a context-aware fusion block for crisp edge detection. FCL-Net [40] proposes a novel network architecture designed to enhance the accuracy of edge detection through fine-scale corrective learning mechanisms. EDETR [41] is the first Transformer-based [42] edge detector which consists of two stages, they can extract well-defined boundaries of objects and meaningful edges by utilizing contextual information from the entire image, as well as detail local cues.

Our approach is motivated by the above pioneer works [33, 19, 17] to address the problem of edge thickness. We make the network produce crisp and clean edge maps by combining multi-scale contextual and second-order derivative information. Additionally, our method successfully addresses the issue of imbalanced pixel distribution by constructing a hybrid focal loss function, resulting in an excellent performance.

III Methodology

In this section, we present the LUS-Net in detail. The whole network can be divided into three parts as shown in Fig. 2: the top-down encoder, the skip-connection, and the bottom-up decoder. The top-down encoder comprises a lightweight pre-trained backbone, which can provide basic features with rich semantic information. The skip-connection consists of several second-order derivative-based multi-scale contextual enhancement modules (SDMCMs), and each SDMCM cascades to the corresponding stage of the backbone. The bottom-up decoder component is built on boundary refinement modules (BRMs), which introduce conditionally parameterized convolution (CondConv) [21] into edge detection for the first time. In particular, the decoder employs a dense connection to fuse different scale features, enhancing the expressive capability of the model. Additionally, our model is supervised using the hybrid focal loss (HFL) function. We provide the details of each component in the following subsections.

III-A Lightweight pre-trained backbone

Most SOTA edge detection methods adopt a pre-trained VGG [43] or ResNet [44] as a feature extraction backbone and then fine-tune them on the edge detection dataset. However, these backbones, with a large number of parameters, demand an expensive computational cost. To achieve efficient edge detection, we adopt and test three lightweight pre-trained backbones: MobileNetV2 [22], ShuffleNetV2 [23], and EfficientNetV2 [24]. These lightweight backbones have demonstrated comparable performance to VGG or ResNet networks while providing efficient inference capabilities. Specifically, these models employ a stride of 2 in the initial convolutional layer, resulting in excessive downsampling. This leads to feature maps with significantly reduced spatial resolution, damaging the quality of the generated contours. Therefore, we modify the stride from 2 to 1. Additionally, we remove the classifier head to make the backbone network suitable for edge detection, which simultaneously reduces the network parameters.

III-B Second-order derivative-based multi-scale contextual enhancement module

Image derivative information is used to locate edge pixels within an image, with commonly employed comprising first-order and second-order derivatives. The first-order derivative detects regions of rapid intensity change, indicating edges. The second-order derivative identifies the exact edge locations through zero-crossing points by measuring the rate of change of intensity. The information provided by the second-order derivative of an image is more precise than that of the first-order derivative because it is more sensitive to abrupt changes in pixel value, which typically corresponds to edge information. Given an input image $I$ , a specific pixel location in the image is represented as $x_{0}$ , the formula for computing the first-order derivative of the image in the x-direction at $x_{0}$ is:

\frac{\partial I}{\partial x}\big{|}_{x=x_{0}}=I(x_{0}+1)-I(x_{0})

(1)

and the formula for the second-order derivative of the image at $x_{0}$ is:

\frac{\partial^{2}f}{\partial x^{2}}\big{|}_{x=x_{0}}=I(x_{0}+1)+I(x_{0}-1)-2I% (x_{0})

(2)

The process of these two derivatives is shown in Fig. 3. For the first-order derivative, pixel values remain constant and have the same sign in areas of slow variation, changing only at sharp grayscale transitions. In contrast, the second-order derivative is zero in slow variation regions but shows opposite signs at abrupt changes, creating a one-pixel width edge, which is precisely the desired output we require. This accurately locates edge pixels, as the opposite signs enhance edge contrast, making the second-order derivative more effective for precise edge detection.

Relying solely on the second-order derivative for edge detection is insufficient because of its sensitivity to noise, which amplifies noise influence despite accurately locating edge pixels. To mitigate this, we propose a second-order derivative-based multi-scale contextual enhancement module (SDMCM). By introducing multi-scale contextual information, the receptive field expands, providing long-range semantic and structural information that helps the network distinguish true edge pixels from noise. True edge pixels are associated with objects or structures, while noise edges lack semantic coherence. Leveraging this information, the network can suppress noise edges and generate crisp edge maps, effectively addressing the thickness issue. The architecture of SDMCM can be seen in Fig. 4, and it can be divided into two parts: the multi-scale contextual path and the second-order derivative path.

In the multi-scale contextual path, we first employ a channel compression operation $\mathcal{C}(\cdot)$ which is a $1\times 1$ convolution to compress the input feature channel. This operation effectively prevents overfitting while simultaneously reducing the number of parameters. The compression ratio $r\in\left\{\frac{1}{2},\frac{1}{4},\frac{1}{8}\right\}$ , we find that changes in compression rate have an impact on the crispness of the predicted edge maps (Section IV provides details). After the channel compression, we construct four parallel branches $\left\{\mathcal{B}_{1}(\cdot),\mathcal{B}_{2}(\cdot),\mathcal{B}_{3}(\cdot),% \mathcal{B}_{4}(\cdot)\right\}$ to produce multi-scale contextual features, and each branch consists of several $3\times 3$ dilated Conv-ReLU sequences. Introducing dilated convolution using a fixed dilated rate can broaden the receptive field, however, it also produces fixed feature intervals, which leads to blurred and discontinuous edges. Therefore, we stack a series of dilated convolutions with different dilated rates $d\in\left\{1,2,3\right\}$ , allowing for filling intervals in the features. In this way, we can utilize all the features in continuous convolution kernels, without producing unused features or fixed intervals, and capturing long-range semantic information by expanding the receptive field. In the end, we combine the features from four branches by element-wise sum to obtain the mixed features which contain rich multi-scale contextual information.

We employ the original Laplacian template to build the second-order derivative path, which consists of a Laplacian-Batchnorm-ReLU sequence, a $3\times 3$ Conv-Batchnorm sequence, and a $1\times 1$ convolution. Additionally, we introduce a shortcut connection in both two paths to more effectively model complex non-linear transformations, thereby enhancing the expressive power of the model. In this situation, for an input two-dimensional feature map $X\in\mathcal{R}^{H\times W}$ , the output feature map $Y\in\mathcal{R}^{H\times W}$ is obtained by

	$\displaystyle Y=$	$\displaystyle\mathcal{S}\left[\mathcal{B}_{1}\left(\mathcal{C}(X)\right)+% \mathcal{B}_{2}\left(\mathcal{C}(X)\right)+\mathcal{B}_{3}\left(\mathcal{C}(X)% \right)+\mathcal{B}_{4}\left(\mathcal{C}(X)\right)\right.$		(3)
		$\displaystyle\left.+\mathcal{C}(X)\right]$		(3)

our newly developed SDMCM significantly enhances the capability of the model to locate the true edge pixels by combining the multi-scale contextual information and the second-order derivative information, consequently making our model generate crisp and precise edge maps.

III-C Boundary refinement module

We develop a novel boundary refinement module (BRM) as shown in Fig. 5, which is based on the conditionally parameterized convolution [21] (CondConv) to further refine the boundaries. CondConv can improve the model adaptability without significantly increasing the number of parameters by consolidating multiple convolutions into a single conditionally parameterized convolutional kernel. We leverage the unique characteristics of CondConv to build the BRM. This module can address the issue of illumination variations in an image, which leads to generating spurious edges.

The BRM consists of a residual block and two $3\times 3$ CondConv-Batchnorm-ReLU sequences. The residual block effectively facilitates information flow by summing operation, and two $3\times 3$ CondConv-Batchnorm-ReLU sequences further refine the features from the residual block, thereby improving the performance in edge detection. Our BRM can learn the contextual features of image illumination disturbances and modulate the convolutional kernel parameters accordingly, which aids in suppressing the spurious edge responses in these regions, resulting in more refined and precise edge maps. By harnessing the adaptive nature of CondConv, the BRM allows the model to adjust its convolutional behavior according to the input context dynamically, mitigating the impact of interference factors, and generating accurate and semantically consistent edge representations.

Additionally, we utilize $3\times 3$ depthwise separable convolution coupled with bilinear interpolation to upsample the resolution of feature maps from BRM. This combination enables us to increase the spatial dimensions of the feature maps while preserving important details and spatial relationships. In the end, we employ a dense connection to cascade each BRM, thereby integrating different scale features and allowing the model to fully leverage high-level semantic information to filter out the noise pixels.

III-D Hybrid focal loss

As a pixel-level binary classification task, edge detection encounters a significant challenge of imbalanced class distribution. The edge pixels typically comprise only about 10% of an image. Such an imbalanced issue causes great difficulties in training the network. Current mainstream methods use the weighted cross-entropy loss to tackle this, but it often results in thick boundaries [33]. To address the issue of imbalanced class distribution, we propose a hybrid focal loss function.

Imbalanced class distribution is a common issue in object detection and image segmentation. To address this, we draw on solutions from these two fields. Specifically, the focal loss [20] in object detection is effective in scenarios with extreme class imbalance and the need for fine detail detection, the Tversky index [19] in image segmentation shines in situations requiring specific trade-off between false positives and false negatives. These characteristics are crucial for edge detection.

The focal loss aims to tackle the issue of imbalanced data sample distribution, and we leverage its idea to address the problem of imbalanced pixel distribution in an image. In pixel-level binary classification, it is well-known that the standard cross-entropy loss function is used to evaluate the proximity of the predictions to the labels. However, in imbalanced pixel distribution, the loss is dominated by easily classified pixels from the majority class, neglecting the harder-to-learn minority class. In this situation, we introduce a focusing parameter $\gamma\geq 0$ into cross-entropy to downweigh the loss for easy (well-classified) pixels and focus training on hard (misclassified) pixels. This ensures the model pays more attention to the minority class and improves its ability to correctly classify these challenging cases, alleviating the imbalanced pixel distribution. The focal loss can be written as:

\displaystyle L_{FL}=-\alpha\sum_{i=1}^{N}\left(\left(1-p_{i}\right)^{\gamma}g% _{i}\log p_{i}+p_{i}^{\gamma}\left(1-g_{i}\right)\log\left(1-p_{i}\right)\right)

(4)

where $p_{i}$ and $g_{i}$ represent the value of $i$ -th pixel on a predicted edge map and its corresponding groundtruth image, respectively. $N$ is the total number of pixels in an image. $(1-p_{i})^{\gamma}$ is a modulating factor and $\alpha$ is a balance factor for positive and negative pixels.

The focal loss function is based on pixel error. When dealing with class imbalance, using such a strategy only will result in correct classification but insufficient accuracy, which leads to generating thick edge maps. To tackle the issue of thickness, we introduce the image-level information and construct a focal Tversky loss function based on the Tversky index. The Tversky index can be written as follows:

\displaystyle T\left(P,G;\alpha,\beta\right)=\frac{|PG|}{|PG|+\alpha|P/G|+% \beta|G/P|}

(5)

where P indicates the predicted edge maps and G indicates its corresponding labels. $\alpha$ and $\beta$ control the weight of false negatives (FNs) and false positives (FPs), respectively. By adjusting the value of $\alpha$ and $\beta$ , we can make a trade-off between FN pixels and FP pixels.

Inspired by the focal loss modification of the cross-entropy loss, we adopt a similar strategy and introduce $\gamma$ into the Tversky index. The focal Tversky loss function can be defined as: {strip}

L_{FT}=\left(\frac{\sum_{i=1}^{N}p_{i}g_{i}+\alpha\sum_{i=1}^{N}\left({{p_{i}(% 1-g_{i})}}\right)^{2}+\beta\sum_{i=1}^{N}\left({(1-p_{i})g_{i}}\right)^{2}+C}{% \sum_{i=1}^{N}p_{i}g_{i}+C}\right)^{\gamma}

(6)

where $p_{i}(1-g_{i})$ and $(1-p_{i})g_{i}$ represent FPs and FNs, respectively. $\alpha+\beta=1$ . $\gamma$ is the focusing parameter to control the easy-training non-edge pixels and hard-training edge pixels, and $C$ is a constant number to prevent the numerator/denominator from becoming 0. In this equation, we set $\alpha=0.3$ , $\beta=0.7$ , because the FN pixels are more important in edge detection. In addition, $\gamma=0.75$ which places greater emphasis on low-accuracy predictions that have been misclassified, and $C=1\times 10^{-7}$ . In the training process, the focal Tversky loss function maximum the value of $\sum_{i=1}^{N}p_{i}g_{i}$ which represents the TP pixels and thereby enables the model to locate the true edge pixels more accurately.

Our hybrid focal loss function is a weighted fusion of the above two loss functions, which can be defined as:

\displaystyle L_{HFL}=L_{FT}+\lambda L_{FL}

(7)

where we set $\lambda=0.001$ to balance the weight between focal loss and focal Tversky loss. The hybrid focal loss function constrains the network from both pixel-level and image-level information. By leveraging the focal loss, the model concentrates on pixels that are challenging to classify while reducing the influence of well-classified pixels, resulting in alleviating the issue of imbalanced distribution. Simultaneously, by integrating the focal Tversky loss, the model can locate the true positive edge pixels more precisely and generate crisp boundaries by utilizing the global image-level information. Therefore, our hybrid focal loss function can tackle both class imbalance issues and thickness problems, consequently improving the performance of the model.

IV Experiments

In this section, we provide a comprehensive account of the implementation details, encompassing hyperparameters, the adopted datasets, and their augmentation strategy. We then introduce the two evaluation methods employed in this work, followed by a series of ablation experiments on our approach. Finally, we compare our proposed method with some state-of-the-art (SOTA) edge detection algorithms and demonstrate its superiority.

IV-A Implementation details

Our network is built using the Pytorch deep learning framework [45]. During the training phase, the hyperparameters are as follows: the mini-batch size is 8, the initial learning rate is $1\times 10^{-4}$ , the learning rate decay is 0.1, the weight decay is $5\times 10^{-4}$ and the number of training epochs is 40. We decay the learning rate every 5 epochs and adopt the Adam [46] method for optimization. All the experiments are performed using a single Tesla A40 GPU.

In addition, we test our method on three benchmarks: BSDS500 [12], NYUD-V2 [13], and BIPED [30]. The Berkeley Segmentation Dataset and Benchmarks 500 (BSDS500) is a comprehensive dataset that has been extensively utilized in the field of computer vision, particularly for evaluating edge detection algorithms. This dataset encompasses a diverse collection of 500 natural images, which are further divided into a standard split of 200 training, 100 validation, and 200 testing images. Each image in the BSDS500 is meticulously annotated by multiple human annotators (usually around 5 to 7), providing a rich set of groundtruth images that represents a wide range of human perception in terms of object boundaries. The NYU Depth Dataset V2 (NYUD-V2) is an influential benchmark that includes depth information and primarily consists of indoor scene images. This dataset comprises 1449 pairs of images, with each pair consisting of RGB and depth images. It is divided into three subsets: a training subset containing 381 images, a validation subset containing 414 images, and a testing subset comprising 654 images. The depth information can be encoded into three channels: horizontal disparity, height above ground, and angle with gravity (HHA). This allows for the storage of depth information in a three-channel RGB image, known as the HHA feature image. The Barcelona Images for Perceptual Edge Detection (BIPED) dataset is a high-quality dataset for evaluating perceptual edge detection algorithms. It contains 250 high-resolution images ( $1280\times 720$ pixels) captured in outdoor scenes. The images are divided into a training set of 200 images and a test set of 50 images. During the training process, we merge the training and validation subsets from BSDS500 and NYUD-V2 into a single set, respectively. As for BIPED, we adopt their data settings.

As for data augmentation, we follow the previous works [9, 10, 33], crop and flip the image-label pairs by randomly rotating 24 angles, and all the three datasets employ the same augment strategy.

IV-B Evaluation methods

To evaluate the quality of generated edge maps, we report the F-score $\left(\frac{2\times Precision\times Recall}{Precision+Recall}\right)$ , which is widely used in edge detection [9, 10, 33]. Specifically, $Precision=\frac{TP}{TP+FP}$ and $Recall=\frac{TP}{TP+FN}$ , where $TP$ , $FP$ , and $FN$ represent the number of correctly classified edge pixels, the number of incorrectly classified edge pixels, and the number of missed edge pixels, respectively. Once the network has generated edge maps, the primary step is to apply a threshold to convert them into binary edge maps, and then match these to the groundtruth images to calculate the F-score. There are two choices to set this threshold: optimal dataset scale (ODS) and optimal image scale (OIS). ODS F-score is calculated across the whole dataset by selecting a fixed threshold that maximizes the F-score for all images collectively. OIS F-score is calculated for each image independently, choosing the best threshold for that particular image to maximize its F-score. In addition, we adopt the same strategy as LPCB [33] used to evaluate our method: standard evaluation (S-Eval) and crispness evaluation (C-Eval).

Standard evaluation method. In previous studies, the predicted edge maps first underwent standard post-processing (NMS and morphological thinning), then matched these post-processed edge maps with their groundtruth images to calculate the ODS and OIS. This standard evaluation method is utilized to determine the correctness of predicted edge maps.

Crispness evaluation method. Our purpose is to make the network generate clearer and crisper edge maps. Therefore, we follow the settings in LPCB [33] to evaluate the crispness of our results, which calculate the ODS and OIS without such standard post-processing.

IV-C Ablation experiments on BSDS500 dataset

For the purpose of evaluating each sub-component performance in our model, we conduct a series of ablation experiments on the BSDS500 dataset and provide the analysis in detail. We report S-Eval and C-Eval for the performance evaluation and all these ablation experiments are conducted using MobileNetV2 as the encoder.

Loss function: Firstly, we perform ablation experiments on our hybrid loss function to demonstrate its effectiveness. To identify the optimal fusion of focal loss and focal Tversky loss, we conduct a series of comparison experiments about different $\lambda$ , and the comparison results are shown in Table I. It can be observed that, as $\lambda$ decreases, there is an increase in C-Eval; however, S-Eval does not exhibit a corresponding increase. This is because the influence of focal loss decreases. Therefore, we determine the optimal value of $\lambda$ is 0.001. Additionally, when fixed $\lambda$ , the best results of both S-Eval and C-Eval are obtained at $\beta=0.7$ , thereby the rest ablation experiments are adopted in this setting.

TABLE I: Comparison experiment results for different values of the hyperparameters

\beta

and

\lambda

in hybrid focal loss.

hyperparameters		S-Eval		C-Eval
$\beta$	$\lambda$	ODS	OIS	ODS	OIS
0.7	1	0.805	0.825	0.686	0.694
0.7	0.1	0.807	0.827	0.685	0.694
0.7	0.01	0.809	0.828	0.691	0.696
0.7	0.001	0.805	0.827	0.698	0.705
0.6	0.001	0.802	0.822	0.685	0.691
0.8	0.001	0.803	0.824	0.689	0.694

Network components: The second ablation experiment aims to assess the effectiveness of each proposed component within our model. We first compare the weighted cross-entropy loss with our hybrid loss by training our model using each of these two loss functions separately. As shown in Table III, our hybrid loss function can obtain a higher F-score than the weighted cross-entropy loss in the C-Eval evaluation, which demonstrates the effectiveness of the proposed loss function. When removing individual modules, it is observed that there are varying degrees of decline in both S-Eval and C-Eval. It should be noted that the performance significantly decreases when BRM is excluded, further demonstrating the remarkable power of BRM. In addition to this, we substitute the Laplacian operator in SDMC with the first-order derivative-based Sobel and Scharr operators. The comparative results, presented in Table II, exhibit varying degrees of performance degradation respectively under these alternative setups. This confirms our view that second-order derivative information is better than first-order derivative information. On the other hand, our weighted cross-entropy-based model has a better performance than other models with the same loss function, such as HED and RCF, while having a lower computational cost. These experiment results fully demonstrate the validity of each component.

TABLE II: A comparative analysis of different derivative information.

Methods	S-Eval		C-Eval
Methods	ODS	OIS	ODS	OIS
Sobel	0.803	0.824	0.690	0.697
Scharr	0.803	0.822	0.687	0.693
Laplacian	0.805	0.827	0.698	0.705

TABLE III: Ablation experiment results of each network component. WCE indicates the weighted cross-entropy loss function. HFL indicates the hybrid focal loss function. SDMCM indicates the second-order derivative-based multi-scale contextual enhancement module. BRM indicates the boundary refinement module. DC indicates the dense connection.

Methods	WCE	HFL	SDMCM	BRM	DC	S-Eval		C-Eval
Methods	WCE	HFL	SDMCM	BRM	DC	ODS	OIS	ODS	OIS
Ours-MobileNetV2	✓	✗	✓	✓	✓	0.812	0.833	0.693	0.702
	✗	✓	✓	✓	✓	0.805	0.827	0.698	0.705
	✗	✓	✗	✓	✓	0.803	0.822	0.686	0.692
	✗	✓	✓	✗	✓	0.794	0.813	0.647	0.654
	✗	✓	✓	✓	✗	0.804	0.825	0.687	0.693

Compression ratio analysis: We explore the effect of different channel compression ratios in SDMCM on the model performance and the experiment results are shown in Fig. 6. As the ratio increases, the number of model parameters also increases, resulting in a higher computational cost. However, the performance does not improve with an increasing compression ratio. The best performance is obtained at $r=\frac{1}{4}$ . The reason for this phenomenon is that a higher compression ratio causes overfitting, resulting in a lower performance both in S-Eval and C-Eval. Therefore, we set the compression ratio to $\frac{1}{4}$ , which achieves a balance between performance and parameters.

IV-D Comparison to some other SOTA methods

In this subsection, we compare our algorithm with current SOTA methods by conducting an experiment using three datasets: BSDS500 [12], NYUD-V2 [13], and BIPED [30].

BSDS500: Firstly, we compare our method with some top-performing algorithms on BSDS500. We select a few recent state-of-the-art edge detectors which can be divided into two categories: the first category is methods without deep learning, which includes Canny [27], gPb-UCM [12] and SE [29]; the second category is approaches using deep learning technique, which including DeepContour [47], DeepEdge [48], HED [9], RCF [10], BDCN [14], CED [32], LPCB [33], DRC [15], PiDiNet [18], EDTER [41] and CHRNet [11]. We additionally reference the studies of other researchers who utilize extra training data sourced from the PASCAL VOC Context dataset [49] and adopt multi-scale testing. Table IV presents the ODS and OIS F-score in S-Eval and C-Eval. Fig. 7 shows some qualitative comparison results, and Fig. 8 draws the Precision-Recall curves.

As shown in Fig. 7, HED and RCF are two classical algorithms for edge detection using VGG-16 [43] architecture, while their edge maps are noisy and blurred. CED and DRC share the same goal as ours, which aims to generate crisp edge maps without any post-processing. However, compared to the other two methods, our method can generate higher-quality and crisper edge maps. Specifically, in the third row of Fig. 7, the contours of the man’s eyes and nose are noisy, but our method can filter out these pixels, unlike the DRC which incorrectly classifies them. Similar results can be observed in other examples.

TABLE IV: Quantitative evaluation of comparison on BSDS500 dataset. VOC indicates training with extra PASCAL VOC Context data. MS indicates the multi-scale testing.

Methods	S-Eval		C-Eval
Methods	ODS	OIS	ODS	OIS
Canny [27]	0.611	0.676	-	-
gPb-UCM [12]	0.729	0.755	-	-
SE [29]	0.743	0.764	-	-
DeepEdge [48]	0.753	0.772	-	-
DeepContour [47]	0.757	0.776	-	-
HED [9]	0.788	0.808	0.576	0.591
DRC [15]	0.802	0.818	0.697	0.705
CED [32]	0.794	0.811	0.642	0.656
RCF [10]	0.798	0.815	0.585	0.604
LPCB [33]	0.800	0.816	0.693	0.700
BDCN [14]	0.806	0.826	0.636	0.650
PiDiNet [18]	0.789	0.803	0.578	0.587
EDTER [41]	0.824	0.841	0.698	0.706
CHRNet [11]	0.787	0.788	-	-
Ours-ShuffleNetV2	0.790	0.812	0.670	0.677
Ours-MobileNetV2	0.805	0.827	0.698	0.705
Ours-EfficientNetV2	0.826	0.846	0.720	0.726
Ours-ShuffleNetV2-VOC	0.804	0.825	0.683	0.691
Ours-MobileNetV2-VOC	0.815	0.835	0.705	0.711
Ours-EfficientNetV2-VOC	0.827	0.846	0.717	0.723
Ours-ShuffleNetV2-VOC-MS	0.813	0.833	0.672	0.677
Ours-MobileNetV2-VOC-MS	0.821	0.842	0.684	0.690
Ours-EfficientNetV2-VOC-MS	0.829	0.851	0.694	0.700

Table IV clearly shows our method using ShuffleNetV2 still achieves a competitive performance when trained with only the BSDS500 dataset, which outperforms HED and RCF in both S-Eval and C-Eval. When using the MobileNetV2 as the encoder, there is an increment in S-Eval and C-Eval whose ODS F-scores are improved from 0.790 to 0.805 and 0.670 to 0.698, respectively. The highest performance is achieved by using the EfficientNetV2 with the ODS=0.826 and OIS=0.846 in S-Eval. These results are better than those of EDTER, which is the first edge detection algorithm using Transformer, particularly considering our model has fewer parameters. In the standard evaluation, when we mix the extra PASCAL VOC Context data into the BSDS500 dataset, there is a significant improvement in the ODS and OIS. The corresponding values are 0.829 and 0.851, reaching the SOTA performance among all the methods. In the crispness evaluation, compared to the CED, LPCB, and DRC, our method using EfficientNetV2 trained with only the BSDS500 dataset demonstrates a significant performance margin, the ODS and OIS values are 0.720 and 0.726, respectively. Specifically, our ODS and OIS are 12.1% and 10.7% higher than those of CED, 3.9% and 3.7% higher than those of LPCB, 3.3% and 3.0% higher than those of DRC. These quantitative results are consistent with the qualitative visualization. All the comparison results demonstrate that our method has excellent performance, we successfully address the issue of edge thickness while improving the method’s performance. Fig. 8 shows that the performance of the human eye in edge detection is 0.803. Our results are better than the human level and obtain the best performance among the current SOTA methods.

NYUD-V2: Secondly, we select the NYUD-V2 dataset to conduct another set of comparison experiments. We adopt some methods as before, which consists of algorithms without using deep learning such as gPb-UCM [12], OEF [50], gPb+NG [51], SE [29] and SE+NG+ [52], and recent top edge detectors based on deep learning such as HED [9], RCF [10], BDCN [14], DRC [15], and CHRNet [11].

Since NYUD-V2 dataset consists of two types of images: RGB images and HHA images. We train and test our model on three versions following the previous researchers’ works: (a) RGB images only (Ours-RGB); (b) HHA images only (Ours-HHA); (c) directly averaging the predictions from the RGB version and the HHA version (Ours-RGB-HHA). Some predicted examples are shown in Fig. 10. The quantitative evaluation results are summarized in Table V and the Precision-Recall curves are drawn in Fig. 9.

TABLE V: Quantitative evaluation values of comparison on NYUD-V2 dataset. RGB indicates RGB images. HHA indicates HHA images. RGB-HHA indicates averaging the prediction of RGB images and HHA images. Ours indicates our method based on EfficientNetV2.

Methods	S-Eval		C-Eval
Methods	ODS	OIS	ODS	OIS
gPb-UCM [12]	0.631	0.661	-	-
OEF [50]	0.651	0.667	-	-
gPb+NG [51]	0.687	0.716	-	-
SE [29]	0.695	0.708	-	-
SE+NG+ [52]	0.706	0.734	-	-
HED-RGB [9]	0.722	0.737	0.387	0.404
HED-HHA [9]	0.691	0.704	0.335	0.350
HED-RGB-HHA [9]	0.746	0.764	0.368	0.384
RCF-RGB [10]	0.745	0.749	0.395	0.412
RCF-HHA [10]	0.701	0.702	0.333	0.348
RCF-RGB-HHA [10]	0.764	0.778	0.374	0.397
BDCN-RGB [14]	0.728	0.762	0.414	0.439
BDCN-HHA [14]	0.704	0.716	0.347	0.367
BDCN-RGB-HHA [14]	0.766	0.779	0.375	0.392
DRC-RGB [15]	0.749	0.762	0.411	0.455
DRC-RGB-HHA [15]	0.711	0.722	0.370	0.382
DRC-RGB-HHA [15]	0.769	0.782	0.403	0.436
CHRNet-RGB [11]	0.729	0.745	-	-
CHRNet-HHA [11]	0.718	0.731	-	-
CHRNet-RGB-HHA [11]	0.750	0.774	-	-
Ours-RGB	0.757	0.768	0.546	0.559
Ours-HHA	0.717	0.727	0.435	0.450
Ours-RGB-HHA	0.768	0.780	0.513	0.524

As shown in Fig. 10, (d) and (e) are the output edge maps from DRC-RGB and Ours-RGB, respectively. The edge maps generated from our method exhibit consistent performance with those of predictions on BSDS500. The predictions from our method are cleaner and crisper than the top edge detector DRC, which demonstrates our effectiveness.

Table V illustrates that our method achieves impressive performance trained only on RGB images and HHA images, respectively. When we average the RGB version and HHA version, there is a noticeable improvement in S-Eval. The ODS achieves 0.768 and OIS achieves 0.780, which reaches the SOTA level. In crispness evaluation, our method obtains the best performance among other advanced methods. More precisely, ours-RGB is 32.8% and 22.9% higher than DRC-RGB in ODS and OIS, 31.9% and 27.3% higher than BDCN-RGB in ODS and OIS. Such a substantial increment fully demonstrates the effectiveness of our method in terms of edge crispness on the NYUD-V2 dataset. The Precision-Recall curves in Fig. 9 show that our method using EfficientNetV2 as the encoder has a top performance. These comparison results demonstrate the effectiveness of our method adequately.

TABLE VI: Quantitative comparison results on BIPED dataset. MS refers to multi-scale testing. Ours refers to our method based on EfficientNetV2.

Methods	ODS	OIS
RCF	0.849	0.861
BDCN	0.890	0.899
CATS	0.887	0.892
DexiNed	0.895	0.900
Ours	0.902	0.908
Ours-MS	0.903	0.909

BIPED: Finally, we further evaluate our method on the BIPED dataset. Since the resolution of each image is relatively high ( $1280\times 720$ ), we crop each image-label pair randomly into $320\times 320$ for training. There are currently limited methods that utilize the BIPED dataset due to its recent proposal. Therefore, we adopt four methods using deep learning for comparison which consists of RCF [10], BDCN [14], CATS [39], and DexiNed [30]. The quantitative comparison results are listed in Table VI and these results are similar to the performance shown on the BSDS500 and NYUD-V2. Our method obtains the highest performance on this benchmark. This is the first method to exceed 0.9 (ODS=0.903, OIS=0.909), demonstrating the powerful robustness of our method.

V Conclusion

In this work, we propose a simple yet effective U-shape network named LUS-Net for crisp edge detection. We leverage the second-order derivative information to help the model locate true edge pixels more accurately. In addition, we construct a novel hybrid focal loss function to solve the issue of imbalanced pixel distribution. We address the issue of edge thickness and our method can generate crisp and clean contours without any post-processing. The experiment results show that we achieve state-of-the-art performance on three standard benchmarks, which demonstrate the advantages and effectiveness of our method. However, there is still room for improvement in our method. We still rely on pre-trained backbones on ImageNet which results in a cumbersome training process. Therefore, we will focus on exploring how to train the network from scratch in the future.

Acknowledgments

This work is supported by the National Key Research and Development Program of China (No.2022YFC3006302).

References

[1] Andrea Manno-Kovacs. Direction selective contour detection for salient objects. IEEE Transactions on Circuits and Systems for Video Technology, 29(2):375–389, 2019.
[2] Aojun Gong, Junfei Nie, Chen Niu, Yuan Yu, Jun Li, and Lianbo Guo. Edge and skeleton guidance network for salient object detection in optical remote sensing images. IEEE Transactions on Circuits and Systems for Video Technology, 33(12):7109–7120, 2023.
[3] Zhengzheng Tu, Yan Ma, Chenglong Li, ** Tang, and Bin Luo. Edge-guided non-local fully convolutional network for salient object detection. IEEE Transactions on Circuits and Systems for Video Technology, 31(2):582–593, 2021.
[4] Shilpa Rani, Deepika Ghai, and Sandeep Kumar. Object detection and recognition using contour based edge detection and fast r-cnn. Multimedia Tools and Applications, 81(29):42183–42207, 2022.
[5] Qihan Feng, Zhiwen Shao, and Zhixiao Wang. Boundary-aware small object detection with attention and interaction. The Visual Computer, pages 1–14, 2023.
[6] Yuxin Sun, Li Su, Yongkang Luo, Hao Meng, Zhi Zhang, Wen Zhang, and Shouzheng Yuan. Irdclnet: Instance segmentation of ship images based on interference reduction and dynamic contour learning in foggy scenes. IEEE Transactions on Circuits and Systems for Video Technology, 32(9):6029–6043, 2022.
[7] Hao Feng, Keyi Zhou, Wengang Zhou, Yufei Yin, Jiajun Deng, Qi Sun, and Houqiang Li. Recurrent generic contour-based instance segmentation with progressive learning. IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2024.
[8] Rui Gu, Lituan Wang, and Lei Zhang. De-net: A deep edge network with boundary information for automatic skin lesion segmentation. Neurocomputing, 468:71–84, 2022.
[9] Saining Xie and Zhuowen Tu. Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision, pages 1395–1403, 2015.
[10] Yun Liu, Ming-Ming Cheng, Xiaowei Hu, Kai Wang, and Xiang Bai. Richer convolutional features for edge detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3000–3009, 2017.
[11] Omar Elharrouss, Youssef Hmamouche, Assia Kamal Idrissi, Btissam El Khamlichi, and Amal El Fallah-Seghrouchni. Refined edge detection with cascaded and high-resolution convolutional network. Pattern Recognition, 138:109361, 2023.
[12] Pablo Arbeláez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5):898–916, 2011.
[13] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. In European conference on computer vision, pages 746–760. Springer, 2012.
[14] Jianzhong He, Shiliang Zhang, Ming Yang, Yanhu Shan, and Tiejun Huang. Bi-directional cascade network for perceptual edge detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3828–3837, 2019.
[15] Y. J. Cao, C. Lin, and Y. J. Li. Learning crisp boundaries using deep refinement network and adaptive weighting loss. IEEE Transactions on Multimedia, 23:761–771, 2021.
[16] Zhong Qu, Sheng-Ye Wang, Ling Liu, and Dong-Yang Zhou. Visual cross-image fusion using deep neural networks for image edge detection. IEEE Access, 7:57604–57615, 2019.
[17] R Jain. Machine vision. McGRAW-HILL google schola, 2:323–331, 1995.
[18] Zhuo Su, Wenzhe Liu, Zitong Yu, Dewen Hu, Qing Liao, Qi Tian, Matti Pietikäinen, and Li Liu. Pixel difference networks for efficient edge detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5117–5127, 2021.
[19] Seyed Sadegh Mohseni Salehi, Deniz Erdogmus, and Ali Gholipour. Tversky loss function for image segmentation using 3d fully convolutional deep networks. In International workshop on machine learning in medical imaging, pages 379–387. Springer, 2017.
[20] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
[21] Brandon Yang, Gabriel Bender, Quoc V Le, and Jiquan Ngiam. Condconv: Conditionally parameterized convolutions for efficient inference. Advances in neural information processing systems, 32, 2019.
[22] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
[23] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European conference on computer vision (ECCV), pages 116–131, 2018.
[24] Mingxing Tan and Quoc Le. Efficientnetv2: Smaller models and faster training. In International conference on machine learning, pages 10096–10106. PMLR, 2021.
[25] Lawrence Gilman Roberts. Machine perception of three-dimensional soups. Massachusetts Institute of Technology, 2017, 1963.
[26] Irwin Edward Sobel. Camera models and machine perception. Stanford University, 1970.
[27] John Canny. A computational approach to edge detection. IEEE Transactions on pattern analysis and machine intelligence, (6):679–698, 1986.
[28] David R Martin, Charless C Fowlkes, and Jitendra Malik. Learning to detect natural image boundaries using local brightness, color, and texture cues. IEEE transactions on pattern analysis and machine intelligence, 26(5):530–549, 2004.
[29] Piotr Dollár and C Lawrence Zitnick. Fast edge detection using structured forests. IEEE transactions on pattern analysis and machine intelligence, 37(8):1558–1570, 2014.
[30] Xavier Soria, Angel Sappa, Patricio Humanante, and Arash Akbarinia. Dense extreme inception network for edge detection. Pattern Recognition, 139:109461, 2023.
[31] François Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.
[32] Yupei Wang, Xin Zhao, Yin Li, and Kaiqi Huang. Deep crisp boundaries: From boundaries to higher-level tasks. IEEE Transactions on Image Processing, 28(3):1285–1298, 2018.
[33] Ruoxi Deng, Chunhua Shen, Shengjun Liu, Huibing Wang, and Xinru Liu. Learning to predict crisp boundaries. In Proceedings of the European Conference on Computer Vision (ECCV), pages 562–578, 2018.
[34] Lee R Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):297–302, 1945.
[35] Ruoxi Deng and Shengjun Liu. Deep structural contour detection. In Proceedings of the 28th ACM international conference on multimedia, pages 304–312, 2020.
[36] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
[37] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
[38] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
[39] Linxi Huan, Nan Xue, Xianwei Zheng, Wei He, Jianya Gong, and Gui-Song Xia. Unmixing convolutional features for crisp edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6602–6609, 2021.
[40] Wenjie Xuan, Shaoli Huang, Juhua Liu, and Bo Du. Fcl-net: Towards accurate edge detection via fine-scale corrective learning. Neural Networks, 145:248–259, 2022.
[41] Mengyang Pu, Ya** Huang, Yuming Liu, Qingji Guan, and Haibin Ling. Edter: Edge detection with transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1402–1412, 2022.
[42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[43] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[44] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[45] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
[46] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[47] Wei Shen, Xinggang Wang, Yan Wang, Xiang Bai, and Zhijiang Zhang. Deepcontour: A deep convolutional feature learned by positive-sharing loss for contour detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3982–3991, 2015.
[48] Gedas Bertasius, Jianbo Shi, and Lorenzo Torresani. Deepedge: A multi-scale bifurcated deep network for top-down contour detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4380–4389, 2015.
[49] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 891–898, 2014.
[50] Sam Hallman and Charless C Fowlkes. Oriented edge forests for boundary detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1732–1740, 2015.
[51] Saurabh Gupta, Pablo Arbelaez, and Jitendra Malik. Perceptual organization and recognition of indoor scenes from rgb-d images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 564–571, 2013.
[52] Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra Malik. Learning rich features from rgb-d images for object detection and segmentation. In European conference on computer vision, pages 345–360. Springer, 2014.