CT-Bound: Robust Boundary Detection From Noisy Images Via Hybrid Convolution and Transformer Neural Networks

Wei Xu1, Junjie Luo, and Qi Guo * Corresponding author Elmore Family School of Electrical and Computer Engineering
Purdue University, West Lafayette, IN, USA
{xu1639,luo330,qiguo}@purdue.edu

Abstract

We present CT-Bound, a robust and fast boundary detection method for very noisy images using a hybrid Convolution and Transformer neural network. The proposed architecture decomposes boundary estimation into two tasks: local detection and global regularization. During the local detection, the model uses a convolutional architecture to predict the boundary structure of each image patch in the form of a pre-defined local boundary representation, the field-of-junctions (FoJ) [9]. Then, it uses a feed-forward transformer architecture to globally refine the boundary structures of each patch to generate an edge map and a smoothed color map simultaneously. Our quantitative analysis shows that CT-Bound outperforms the previous best algorithms in edge detection on very noisy images. It also increases the edge detection accuracy of FoJ-based methods while having a 3-time speed improvement. Finally, we demonstrate that CT-Bound can produce boundary and color maps on real captured images without extra fine-tuning and real-time boundary map and color map videos at ten frames per second.

Index Terms:

boundary estimation, image denoising, convolutional neural network, transformer

{strip}

\captionof

figureBoundary detection from noisy images. Compared to a variety of models [1, 2, 3, 4, 5, 6, 7, 8, 9], ours robustly detects the boundaries even when they are visually challenging to discriminate.

I Introduction

Detecting boundary structures from very noisy images is a common and challenging computer vision problem [10]. There have been many applications that require boundary detection from images with very low light levels, such as medical imaging, manufacturing, autonomous navigation, etc. Although image boundary detection has been broadly studied since the early stage of computer vision [11, 12, 13, 6, 14], our results show that the accuracy of current best boundary detection algorithms are still unsatisfactory when the input images have a very low light level (Fig. CT-Bound: Robust Boundary Detection From Noisy Images Via Hybrid Convolution and Transformer Neural Networks).

We present CT-Bound. It is a deep neural network architecture that can robustly detect boundaries from a single noisy image. The model processes the input image to predict a generalized local boundary representation for each image patch called the field-of-junctions (FoJs) [9]. FoJ can represent a variety of boundary types in image patches, including edges, corners, and contours, and is an effective prior in edge detection, especially for noisy images [9, 1]. By constraining the predicted boundary structures to those that FoJ can describe, we observe our model can detect very faint edge signals in the presence of significant noise (Fig. CT-Bound: Robust Boundary Detection From Noisy Images Via Hybrid Convolution and Transformer Neural Networks). Our experiment shows that CT-Bound achieves the highest boundary detection accuracy among a variety of recent edge detection methods.

CT-Bound consists of an innovative two-stage, hybrid Convolution and Transformer neural network architecture (Fig. 2). The first stage is a convolutional architecture that makes an initial prediction of a local FoJ parameterization solely based on the visual appearance of each image patch. The second stage consists of a feedforward transformer encoder that takes in the initial FoJ estimation of all image patches to perform refinement. The architecture is novel as it completely decomposes boundary estimation into two tasks: detecting boundaries from local image patch and regularizing neighboring boundary estimations to ensure consistency and to look like natural boundaries. The convolutional network stage conducts boundary detection using a small receptive field ( $21\times 21$ in our experiments). Thus, it does not need to learn the global appearance of images and can be trained using synthetic image patches that only contain basic boundary structures. The transformer stage only receives the FoJ representation and has no access to the input image during inference. Therefore, the computational complexity of the transformer in the CT-Bound is significantly lower than that of the classic Vision Transformer [15].

Besides verifying the accuracy and robustness of the proposed algorithm using noisy images simulated from standard, benchmark datasets under different noise levels, we also test CT-Bound using noisy images captured using real-world cameras to generate real-time videos of boundary maps. The contribution of this paper can be summarized as follows.

•

A two-stage, hybrid neural network architecture;
•

A robust and fast and non-iterative solver of FoJ that enables real-time boundary detection on very noisy images;
•

A thorough experimental study that demonstrates CT-Bound achieves the highest or among the highest accuracy in detecting image boundaries from very noisy images compared to the previous best algorithms.

The code, the training data, the testing data, additional results, and the video demonstration of the proposed method can be found at https://github.com/guo-research-group/CT-Bound.

II Related Work

According to the classification by Gong et al. [16], image boundary detection methods can be divided into four categories, i.e., boundaries from luminance changes, texture changes, perceptual grou**, and illusory contour. We focus our literature review on the first category to which the proposed method belongs.

Local boundary detection

The first step of boundary detection is typically using specially designed filters to locate local responses of image boundaries. The filters either maximize the detectability and localization accuracy of the boundaries under noise, such as the Roberts cross operator [14], the Canny detectors [6], the Laplacian detectors [17], and the perfect matching filters [18] or are sensitive to the direction of the image boundaries, e.g., Sobel filters [11], Gaussian quadrature pairs [13], Steerable filters [19], etc. To robustly detect boundaries under non-ideal edges, people also develop sophisticated filters that are non-linear [20] or operate in multiple scales [21]. However, these classic methods based on local patches are found to be insufficient when the noise of the image is severe and have limited visual information to confidently determine the edges from the receptive field of a single filter. In this case, image smoothing is also not a solution as the smoothing operation will make faint or fine boundary structures to be indistinguishable [18].

Global boundary refinement

Boundaries in natural images are usually piece-wise smooth. Based on this observation, people have developed algorithms to refine the global boundary map from the locally detected boundaries by regularizing the curvature, for example, the squared curvature [22] or total curvature [23] along the boundary. Another observation for global boundary refinement is that the neighboring boundary maps must agree. There have been methods that use this intuition by enforcing neighboring consistency [9, 24, 12]. These global refinement methods are typically iterative and thus could have a high computational complexity.

Deep learning boundary detection

The emergence of deep learning has enabled people to develop deep neural networks that fuse the two steps into an end-to-end architecture learned from data. These methods directly output global boundary maps in nonparametric ways [3, 2, 7, 8, 5, 25, 26] or parametric ways [27, 28]. These methods typically outperform traditional, non-learning-based edge detection methods nowadays. There is a recent work, Boundary Attention, that also combines FoJ representation with deep neural networks [1]. It is targeted for detecting complicated, fine boundary structures and outputting more in-depth boundary information, such as edge-aware distance maps, from noisy images. Compared with Boundary Attention, CT-Bound focuses on boundary detection and achieves a higher boundary detection accuracy on benchmark datasets with a 3-time faster speed. Nonetheless, we suggest readers also read the paper [1] for a more comprehensive understanding of this field.

III Methods

Refer to caption — Figure 1: Field-of-junction (FoJ) representation [9].

We first briefly describe the FoJ representation [9] that we adopt in this work. Given an image patch $P\in\mathbb{R}^{h\times w\times k}$ with dimension $h\times w$ and $k$ color channel, the FoJ models its boundary structure using a parameter set $\Phi=(\boldsymbol{x},\boldsymbol{\phi},\boldsymbol{c})$ , where $\boldsymbol{x}=(x_{0},y_{0})$ indicates the center of the vertex, $\boldsymbol{\phi}=(\phi_{1},\cdots,\phi_{l})$ represents the angles of the $l$ edges, $\boldsymbol{c}=(\boldsymbol{c}_{1},\cdots,\boldsymbol{c}_{l}),\boldsymbol{c}_{% j}\in\mathbb{R}^{k},j=1,\cdots,l$ are the color of the region between every pair of neighboring edges. The parameter $l$ is a hyperparameter that needs to be predetermined. See Fig. 1 for an exemplar illustration of FoJ. As shown in Verbin and Zickler, FoJ can represent a variety of local boundary structures, including edges, corners, and junctions [9]. Given an FoJ representation $\Phi$ , the corresponding boundary map of the patch $B(x,y;\Phi)$ can be plotted via:

\displaystyle B(x,y;\Phi)=\pi\epsilon H_{2,\epsilon}^{\prime}(\min(d_{j}(x,y))),

(1)

where $d_{j}(x,y)$ is the distance from the pixel $(x,y)$ to the edge $j$ and $H_{2,\epsilon}^{\prime}$ is the derivative of Heaviside function $H_{2,\epsilon}$ in [29]:

\displaystyle H_{2,\epsilon}(d)=\frac{1}{2}\left(1+\frac{2}{\pi}\arctan\frac{d% }{\epsilon}\right),

(2)

where $\epsilon$ is a smoothing parameter that we set $\epsilon=0.01$ throughout our experiments. The color map $C(x,y;\Phi)$ can be visualized using:

\displaystyle C(x,y;\Phi)=\sum_{j=1}^{l}\delta_{j}(x,y)\boldsymbol{c}_{j},

(3)

where $\delta_{j}(x,y)=1$ when the pixel $(x,y)$ is within the wedge between edge $j$ and $j+1$ in the patch $P$ :

$\displaystyle\Omega_{j}=\{$	$\displaystyle(x,y)\|(x,y)\in P,$	(4)
	$\displaystyle(x-x_{0})\cos\phi_{j}+(y-y_{0})\sin\phi_{j}<0,$
	$\displaystyle(x-x_{0})\cos\phi_{j+1}+(y-y_{0})\sin\phi_{j+1}>0\},$

and $\delta_{j}(x,y)=0$ otherwise.

III-A Network architecture

The network architecture of the proposed method is visualized in Fig. 2. Given a noisy image $I\in\mathbb{R}^{H\times W\times k}$ , CT-Bound first divides the image $I$ into overlap** patches. The initialization stage takes in each image patch $P_{m,n}$ into a CNN to generate the initial vertex location and the edge angles of the FoJ representation $(\boldsymbol{x}^{\text{init}}_{m,n},\boldsymbol{\phi}^{\text{init}}_{m,n})$ . Then, the method determines the color parameters $\boldsymbol{c}^{\text{init}}_{m,n}=(\boldsymbol{c}^{\text{init}}_{m,n,1},% \cdots,\boldsymbol{c}^{\text{init}}_{m,n,l})$ mathematically by averaging the color of pixels of each divided area of the patch:

\displaystyle\boldsymbol{c}^{\text{init}}_{m,n,j}=\frac{1}{|\Omega_{m,n,j}|}% \sum_{(x,y)\in\Omega_{m,n,j}}P_{m,n}(x,y,:),

(5)

where $\Omega_{m,n,j}$ indicates the set of pixels in the wedge between edges $j$ and $j+1$ in the patch $P_{m,n}$ , as defined in (4).

The refinement stage takes in the initial FoJ representation $(\boldsymbol{x}^{\text{init}}_{m,n},$ $\boldsymbol{\phi}^{\text{init}}_{m,n},$ $\boldsymbol{c}^{\text{init}}_{m,n})$ of all patches $P_{m,n}$ simultaneously. It first converts each initial FoJ representation $(\boldsymbol{x}^{\text{init}}_{m,n},\boldsymbol{\phi}^{\text{init}}_{m,n},$ $\boldsymbol{c}^{\text{init}}_{m,n})$ into a feature vector representation $\boldsymbol{v}_{m,n}\in\mathbb{R}^{d}$ , and applies positional encoding by adding a positional vector $\boldsymbol{p}_{m,n}=[p_{m,n,1},\cdots,p_{m,n,d}]^{T}$ to the feature vector $\boldsymbol{v}_{m,n}$ to incorporate the positional information of each image patch. The 2D positional encoding vector follows the design of Zhang and Liu [30]:

\displaystyle p_{m,n,i}

\displaystyle=\begin{cases}\sin\left(\frac{m}{10000^{4i/D}}\right),i=0,2,4,% \cdots,d/2\\ \cos\left(\frac{m}{10000^{4i/D}}\right),i=1,3,5,\cdots,d/2+1\\ \sin\left(\frac{n}{10000^{4i/D}}\right),i=d/2,d/2+2,\cdots,d-2\\ \cos\left(\frac{n}{10000^{4i/D}}\right),i=d/2+1,d/2+3,\cdots,d-1\end{cases}

where the dimension of each feature vector $d$ is an even number. All positional encoded feature vectors are fed into a transformer encoder consisting of a series of multi-head attention layers to refine the boundary consistency among patches globally and adjust unnatural boundary estimations. It is the only block in the framework that globally shares the per-patch FoJ information. Finally, the framework outputs the refined vertex location and edge angles of all patches $(\boldsymbol{x}^{\text{ref}}_{m,n},\boldsymbol{\phi}^{\text{ref}}_{m,n}),% \forall m,n$ , and calculate the refined color parameters $\boldsymbol{c}^{\text{ref}}_{m,n}$ using (5). We list the network hyperparameters we use in our experiment in Tab. I. As the transformer does not operate on the image domain, the dimension of the input vector and the number of layers are much smaller compared to the classic Vision Transformer [15].

TABLE I: CT-Bound model hyperparameter.

Convolutional Neural Network
Layer	Specification	Output
Conv2d	5 $\times$ 5 kernel, 4 stride, 2 pad	(21, 21, 96)
MaxPool2d	3 $\times$ 3 kernel, 2 stride, 0 pad	(10, 10, 96)
Conv2d	5 $\times$ 5 kernel, 1 stride, 2 pad	(10, 10, 256)
MaxPool2d	2 $\times$ 2 kernel, 2 stride, 0 pad	(5, 5, 256)
Conv2d	3 $\times$ 3 kernel, 1 stride, 1 pad	(5, 5, 384)
Conv2d	3 $\times$ 3 kernel, 1 stride, 1 pad	(5, 5, 384)
Conv2d	3 $\times$ 3 kernel, 1 stride, 1 pad	(5, 5, 256)
MaxPool2d	3 $\times$ 3 kernel, 2 stride, 0 pad	(2, 2, 256)
FC	-	4096
FC	-	1024
FC	-	5
Transformer Encoder
Specification		Parameter
Dimension of each input vector		128
Number of layers		8
Number of heads in each layer		8
Dimension of the feed-forward layer		256

From the refined FoJ parameters $\Phi^{\text{ref}}_{m,n}=(\boldsymbol{x}^{\text{ref}}_{m,n},$ $\boldsymbol{\phi}^{\text{ref}}_{m,n},$ $\boldsymbol{c}^{\text{ref}}_{m,n})$ , CT-Bound generates the per-patch boundary and color map according to (1) and (3), and computes the global boundary map $B(x,y)$ by averaging the per-patch boundary maps [9]:

\displaystyle B(x,y)=\frac{1}{N(x,y)}\sum_{N(x,y)}B(x,y;\Phi^{\text{ref}}_{m,n% }),

(6)

and the global color map $C(x,y)$ via a specific smoothing operation over the per-patch color maps:

\displaystyle C(x,y)=\frac{1}{\left|N(x,y)\right|}\sum_{N(x,y)}\delta_{m,n,j}(% x,y)\boldsymbol{c}^{\text{ref}}_{m,n,j},

(7)

where $N(x,y)=\left\{(m,n)|(x,y)\in\Omega_{m,n,j}\right\}$ is the set of the patch indices that contain $(x,y)$ , $\delta_{m,n,j}$ is a binary indicator that is $1$ if pixel $(x,y)$ belongs to the wedge $\Omega_{m,n,j}$ and $0$ otherwise, and $\boldsymbol{c}^{\text{ref}}_{m,n,j}$ is the refined color of wedge $\Omega_{m,n,j}$ .

III-B Loss functions

We develop a multi-stage training scheme to optimize the parameters of CT-Bound. First, we train the initialization stage using the patch reconstruction loss:

\displaystyle\mathcal{L}_{\text{init}}=\mathbb{E}_{P}\left(\mathrm{MSE}\left(C% \left(x,y;\Phi^{\text{gt}}\right),C\left(x,y;\Phi^{\text{init}}\right)\right)% \right),

(8)

where $\mathbb{E}_{P}$ denotes the expectation over all patches in the training set, and $C(x,y;$ $\Phi^{\text{gt}})$ and $C(x,y;\Phi^{\text{init}})$ indicate the per-patch color maps reconstructed using true and estimated FoJ parameters, respectively. We observe that the visual quality of the FoJ estimation is higher when using the loss in (8) for training than directly supervising the FoJ parameters. Furthermore, because the CNN in the initialization stage has a small receptive field, we can use synthetic image patches of basic shapes to train it, and we observe that the trained model can be generalized to real-world image patches without further fine-tuning.

When optimizing parameters of the refinement stage, we use a fixed, pre-trained initialization stage to generate inputs $\Phi^{\text{init}}$ . We invent a two-step training process for optimizing the refinement stage, which was noticed to lead to a more stable and faster convergence. In the first step, a mean squared error loss function is adopted to supervise the estimated FoJ parameters directly:

\displaystyle\mathcal{L}_{\text{ref1}}=\mathbb{E}_{P}\left(\lVert\boldsymbol{x% }^{gt}-\boldsymbol{x}^{\text{ref}}\rVert^{2}+\lVert\boldsymbol{\phi}^{gt}-% \boldsymbol{\phi}^{\text{ref}}\rVert^{2}\right).

(9)

In the second step, we use a comprehensive image reconstruction loss adapted from Verbin and Zickler [9]:

\displaystyle\mathcal{L}_{\text{ref2}}=\mathbb{E}_{I}\left(l_{p}+\lambda_{b}l_% {b}+\lambda_{c}l_{c}\right),

(10)

where $\mathbb{E}_{I}$ indicates the expectation over all images in the dataset, and $l_{p}$ , $l_{b}$ , and $l_{c}$ are patch, boundary, and color loss terms, respectively:

	$\displaystyle l_{p}$	$\displaystyle=\sum_{m,n}\sum_{j=1}^{l}\sum_{x,y}\delta_{m,n,j}(x,y)\lVert% \boldsymbol{c}^{\text{ref}}_{m,n,j}-I(x,y)\rVert^{2},$
	$\displaystyle l_{b}$	$\displaystyle=\sum_{m,n}\sum_{x,y}\left(B(x,y)-B(x,y;\Phi^{\text{ref}}_{m,n})% \right)^{2},$
	$\displaystyle l_{c}$	$\displaystyle=\sum_{m,n}\sum_{j=1}^{l}\sum_{x,y}\delta_{m,n,j}(x,y)\lVert% \boldsymbol{c}^{\text{ref}}_{m,n,j}-C(x,y)\rVert^{2}.$

In Verbin and Zickler, the loss in (10) was solved in an alternating, two-step fashion to refine the FoJ representation iteratively [9]. We evaluate the loss in a single step and show it can successfully fine-tune the feed-forward transformer encoder to improve the FoJ representation.

IV Experimental Results

IV-A Data processing

For the training of the initialization stage, we use randomly sampled patches from FoJ synthetic datasets [9] that only contain images of basic shapes such as squares. We select $8000$ image patches for training and $2000$ for testing. To simulate image noise, we apply a Poisson-Gaussian process to each image patch [31]:

\displaystyle P(x,y)=\mathrm{Poisson}\left(\alpha P^{*}(x,y)\right)+\mathrm{% Gaussian}\left(0,\sigma^{2}\right),

(11)

where $P(x,y)$ and $P^{*}(x,y)\in[0,1]$ are the noisy and normalized clean image patches, $\alpha$ is the photon level parameter that controls the noise of the image, and $\sigma$ is the standard deviation of read noise ( $\sigma=2$ is applied as in [32]). For the refinement stage, we use images from MS COCO [33] for training and testing. The training and testing sets contain $1600$ and $400$ randomly selected, non-overlap** images, respectively. Each image is cropped at the center to the size of $147\times 147$ and is applied with the same noise as described in (11). In our experiment, we randomly set the photon level $\alpha$ within the range $[2,10]$ to generate images with a variety of noise levels.

TABLE II: ODS F1-score of CT-Bound before (numbers in red) and after (numbers in blue) the refinement.

Photon level $\alpha_{\text{test}}$	Dataset
Photon level $\alpha_{\text{test}}$	BSDS500 [12]		NYUDv2 [34]
2	0.482	0.541	0.479	0.552
4	0.509	0.627	0.522	0.633
6	0.518	0.640	0.538	0.646
8	0.524	0.633	0.546	0.647

We evaluate CT-Bound on the testing sets of Berkeley Segmentation Data Set 500 (BSDS500) [12] and NYU Depth Dataset V2 (NYUDv2) [34]. We crop images to $147\times 147$ size and add noise as above. BSDS500 has 200 testing images. For NYUDv2, 200 images are randomly selected from its testing set split and adopted by in [35]. Using different datasets for evaluation demonstrates the generalizability of our model.

IV-B Implementation details

We use $l=3$ in our implementation. All optimizations in this work use the Adam optimizer [36]. The initialization stage is trained with an initial learning rate of $0.0002$ and a decay of $0.5$ every $80$ epochs. The batch size is $32$ , and the total number of training epochs is $900$ . We use a two-step scheme to train the refinement stage as described in Sec. III-B. Both steps use a batch size $16$ . The first step uses (9) as the objective function and has $100$ epochs. a learning rate $5\times 10^{-5}$ . The second step switches to (10) as its loss function and runs $1600$ epochs. The learning rate for the second step is updated with a triangular cycle between $1.75\times 10^{-4}$ and $3.5\times 10^{-4}$ . The training and testing are performed on a machine with an NVIDIA GeForce RTX A5000 graphics card and 24 GB memory.

The fixed contour threshold (ODS) F1-score is recorded during evaluation, with a non-maximum suppression [6] applied in advance. We adjust the localization tolerance proportionally based on [5] to accommodate the image size in our experiment, setting it to 0.0209 for BSDS500 and 0.0372 for NYUDv2.

IV-C Ablation study

TABLE III: Quantitative comparison of boundary detection on noisy images synthesized from BSDS500 [12] and NYUDv2 [34] datasets. The numbers are ODS F1-score. The proposed method and Boundary Attention [1] demonstrate the highest or the second highest boundary detection accuracy on both datasets when the image is very noisy, i.e.,

\alpha_{\text{test}}=2

. Meanwhile, ours is three times faster than Boundary Attention [1]. When the noise level becomes lower, i.e.,

\alpha_{\text{test}}

being larger, Restormer [4]

\rightarrow

HED [5] starts to perform better. This result suggests the robustness of our method in detecting boundaries at extremely high noise levels.

Model	Publication’Year	BSDS500 [12]				NYUDv2 [34]				FPS
		Photon level $\alpha_{\text{test}}$								^§OpenCV-CPU
		2	4	6	8	2	4	6	8	^†PyTorch-GPU ^‡JAX-GPU
Canny [6]	PAMI’86	0.493	0.489	0.383	0.495	0.476	0.467	0.325	0.482	625^§
HED [5]	ICCV’15	0.327	0.388	0.456	0.520	0.307	0.363	0.434	0.484	18.8^§
FoJ [9]	ICCV’21	0.509	0.564	0.597	0.611	0.529	0.576	0.597	0.613	1/68^†
PiDiNet [8]	ICCV’21	0.316	0.356	0.447	0.480	0.274	0.311	0.409	0.456	125^†
EDTER [7]	CVPR’22	0.484	0.509	0.545	0.594	0.488	0.478	0.534	0.581	15.6^†
Restormer [4] $\rightarrow$ Canny [6]	CVPR’22+PAMI’86	0.490	0.521	0.518	0.516	0.511	0.533	0.503	0.499	9.6^†
Restormer [4] $\rightarrow$ HED [5]	CVPR’22+ICCV’15	0.459	0.628	0.674	0.707	0.474	0.625	0.647	0.663	6.4^†
UAED [3]	CVPR’23	0.473	0.553	0.618	0.665	0.452	0.557	0.620	0.652	28.7^†
PEdger [2]	ACMMM’23	0.525	0.611	0.657	0.684	0.509	0.606	0.632	0.650	14.4^†
Boundary Attention [1]	arXiv’24	0.534	0.591	0.607	0.615	0.572	0.609	0.624	0.625	4.3^‡
Ours	-	0.541	0.627	0.640	0.633	0.552	0.633	0.646	0.647	11.4^†, 15.2^‡

We analyze the benefit offered by the refinement stage of CT-Bound. As shown in Fig. 3, the refinement stage attenuates noisy and inconsistent boundary estimations and strengthens real boundaries compared to the boundary map from the initialization stage. It also makes the color map appear smoother and sharper at color boundaries. The quantitative analysis is shown in Tab. II. It draws the same conclusion: the refinement stage increases the ODS F1-score of the boundary map compared to the initialization stage.

IV-D Analysis on synthetic and real images

We compare the proposed method with the iterative FoJ solver [9], the traditional edge detector Canny [6], and other learning-based models, including Boundary Attention [1], PEdger [2], UAED [3], Restormer [4] $\rightarrow$ HED [5], Restormer [4] $\rightarrow$ Canny [6], EDTER [7], PiDiNet [8], and HED [5]. Tab. III shows the quantitative comparison of these methods. Note that our model is trained using images from MS COCO dataset [33] with random photon level $\alpha_{\text{train}}\in[2,10]$ and evaluated using images from BSDS500 [12] and NYUDv2 [34] datasets on a specific photon level $\alpha_{\text{test}}$ . The proposed approach achieves the highest or near-highest ODS F1-score when the noise level is very high, i.e. $\alpha=2$ . Fig. 4 shows sample boundary maps estimated from noisy images synthesized from BSDS500 [12] and NYUDv2 [34] datasets. Both results indicate the robustness and generalizability of the proposed method to different image datasets and noise levels.

Fig. 5 shows the boundary maps and color maps estimated from a real image captured by an iPhone 13 Mini camera with a high shutter speed. The boundary map and color map generated from the proposed method have high visual quality without any fine-tuning to the real images. We also upload a video of boundary maps and color maps of a real captured video clip generated by CT-Bound to the URL listed in Sec. I.

V Conclusion and Limitation

In this paper, we propose a two-stage boundary detector, CT-Bound, which is a hybrid neural network architecture aiming to achieve robust and accurate boundary detection on extremely noisy images in a single shot. Compared to a variety of models, our method demonstrates the highest or near-highest boundary detection accuracy on benchmark datasets, producing visually clean and crisp boundaries. A limitation we observe is that, when processing videos, there are abrupt changes in detected boundaries across frames. This is because CT-Bound only uses a single frame for processing. The problem can be resolved by introducing temporal consistencies in the model.

References

[1] Mia Gaia Polansky, Charles Herrmann, Junhwa Hur, Deqing Sun, Dor Verbin, and Todd Zickler, “Boundary attention: Learning to find faint boundaries at any resolution,” arXiv preprint arXiv:2401.00935, 2023.
[2] Yuanbin Fu and Xiaojie Guo, “Practical edge detection via robust collaborative learning,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 2526–2534.
[3] Caixia Zhou, Ya** Huang, Mengyang Pu, Qingji Guan, Li Huang, and Haibin Ling, “The treasure beneath multiple annotations: An uncertainty-aware edge detector,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15507–15517.
[4] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang, “Restormer: Efficient transformer for high-resolution image restoration,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5728–5739.
[5] Saining Xie and Zhuowen Tu, “Holistically-nested edge detection,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1395–1403.
[6] John Canny, “A computational approach to edge detection,” IEEE Transactions on pattern analysis and machine intelligence, , no. 6, pp. 679–698, 1986.
[7] Mengyang Pu, Ya** Huang, Yuming Liu, Qingji Guan, and Haibin Ling, “Edter: Edge detection with transformer,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1402–1412.
[8] Zhuo Su, Wenzhe Liu, Zitong Yu, Dewen Hu, Qing Liao, Qi Tian, Matti Pietikäinen, and Li Liu, “Pixel difference networks for efficient edge detection,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 5117–5127.
[9] Dor Verbin and Todd Zickler, “Field of junctions: Extracting boundary structure at low snr,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6869–6878.
[10] Rui Sun, Tao Lei, Qi Chen, Zexuan Wang, Xiaogang Du, Weiqiang Zhao, and Asoke K Nandi, “Survey of image edge detection,” Frontiers in Signal Processing, vol. 2, pp. 826967, 2022.
[11] FG Irwin et al., “An isotropic 3x3 image gradient operator,” Presentation at Stanford AI Project, vol. 1968, pp. 3, 2014.
[12] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik, “Contour detection and hierarchical image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 5, pp. 898–916, 2010.
[13] Jitendra Malik, Serge Belongie, Thomas Leung, and Jianbo Shi, “Contour and texture analysis for image segmentation,” International journal of computer vision, vol. 43, pp. 7–27, 2001.
[14] L Roberts, “Machine perception of 3-d solids, optical and electro-optical information processing,” 1965.
[15] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[16] Xin-Yi Gong, Hu Su, De Xu, Zheng-Tao Zhang, Fei Shen, and Hua-Bin Yang, “An overview of contour detection approaches,” International Journal of Automation and Computing, vol. 15, pp. 656–672, 2018.
[17] Xin Wang, “Laplacian operator-based edge detectors,” IEEE transactions on pattern analysis and machine intelligence, vol. 29, no. 5, pp. 886–890, 2007.
[18] Nati Ofir, Meirav Galun, Sharon Alpert, Achi Brandt, Boaz Nadler, and Ronen Basri, “On detection of faint edges in noisy images,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 4, pp. 894–908, 2019.
[19] William T Freeman, Edward H Adelson, et al., “The design and use of steerable filters,” IEEE Transactions on Pattern analysis and machine intelligence, vol. 13, no. 9, pp. 891–906, 1991.
[20] Pietro Perona, Jitendra Malik, et al., “Detecting and localizing edges composed of steps, peaks and roofs,” 1991.
[21] Y. Lu and R.C. Jain, “Reasoning about edges in scale space,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 14, no. 4, pp. 450–468, 1992.
[22] Claudia Nieuwenhuis, Eno Toeppe, Lena Gorelick, Olga Veksler, and Yuri Boykov, “Efficient squared curvature,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 4098–4105.
[23] Qiuxiang Zhong, Yutong Li, Yijie Yang, and Yu** Duan, “Minimizing discrete total curvature for image processing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9474–9482.
[24] Nati Ofir, Meirav Galun, Boaz Nadler, and Ronen Basri, “Fast detection of curved edges at low snr,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 213–221.
[25] Xavier Soria Poma, Edgar Riba, and Angel Sappa, “Dense extreme inception network: Towards a robust cnn model for edge detection,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 1923–1932.
[26] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
[27] Kun Huang, Yifan Wang, Zihan Zhou, Tianjiao Ding, Shenghua Gao, and Yi Ma, “Learning to parse wireframes in images of man-made environments,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 626–635.
[28] Nan Xue, Song Bai, Fudong Wang, Gui-Song Xia, Tianfu Wu, and Liangpei Zhang, “Learning attraction field representation for robust line segment detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1595–1603.
[29] Tony F Chan and Luminita A Vese, “Active contours without edges,” IEEE Transactions on image processing, vol. 10, no. 2, pp. 266–277, 2001.
[30] Zelun Wang and Jyh-Charn Liu, “Translating math formula images to latex sequences using deep neural networks with sequence-level training,” International Journal on Document Analysis and Recognition (IJDAR), vol. 24, no. 1-2, pp. 63–75, 2021.
[31] Qiaoqiao Ding, Yong Long, Xiaoqun Zhang, and Jeffrey A Fessler, “Modeling mixed poisson-gaussian noise in statistical image reconstruction for x-ray ct,” Arbor, vol. 1001, pp. 48109, 2016.
[32] Stanley H Chan, “What does a one-bit quanta image sensor offer?,” IEEE Transactions on Computational Imaging, vol. 8, pp. 770–783, 2022.
[33] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755.
[34] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus, “Indoor segmentation and support inference from rgbd images,” in Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12. Springer, 2012, pp. 746–760.
[35] Saurabh Gupta, Pablo Arbelaez, and Jitendra Malik, “Perceptual organization and recognition of indoor scenes from rgb-d images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 564–571.
[36] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

Supplemental Materials for CT-Bound: Robust Boundary Detection From Noisy Images Via Hybrid Convolution and Transformer Neural Networks

In this supplementary document, we first show some samples of our training data in Sec. S1. Then we report a quantitative comparison for color maps on BSDS500 [12] and NYUDv2 [34] datasets in Sec. S2. Finally we present more qualitative results from BSDS500 [12] and NYUDv2 [34] datasets in Sec. S3.

S1 Training Data

We use different datasets for the initialization and refinement stages. For the former stage, our model needs patches as inputs only. We generate noisy patches based on those randomly sampled from FoJ synthetic datasets [9], which contain basic shapes in grayscale. We assign RGB colors randomly and add noise according to (11). For the latter stage, we use noisy images generated from MS COCO [33] with the same noise model. The ground truth boundary maps are generated by running FoJ [9] with the ground truth color maps. Some training pair samples are shown in Fig. S1.

S2 Quantitative Results of Color Maps

Since different photon levels cause different distribution parameters of pixel values, we normalize the pixel values before calculating the metrics for color maps. Specifically, a color map is normalized through:

\displaystyle P^{\prime}(x,y)=\frac{P(x,y)}{\alpha},

(S1)

where $P(x,y)$ and $\alpha$ is from (11). Then the structural similarity index measure (SSIM), peak signal-to-noise ratio (PSNR), and mean squared error (MSE) are calculated between $P^{\prime}(x,y)$ and $P^{*}(x,y)$ . The quantitative comparison is shown in Tab. S1.

TABLE S1: Quantitative comparison of color maps on noisy images synthesized from BSDS500 [12] and NYUDv2 [34] datasets.

Photon level $\alpha_{\text{test}}$	Model	BSDS500			NYUDv2
Photon level $\alpha_{\text{test}}$	Model	SSIM $\uparrow$	PSNR(dB) $\uparrow$	MSE( $\times 10^{-2}$ ) $\downarrow$	SSIM $\uparrow$	PSNR(dB) $\uparrow$	MSE( $\times 10^{-2}$ ) $\downarrow$
2	FoJ [9]	0.338	10.712	8.626	0.433	10.657	8.954
	Restormer [4]	0.152	8.492	14.624	0.157	8.567	14.312
	Boundary Attention [1]	0.301	10.952	8.982	0.355	10.791	9.284
	Ours	0.332	10.724	8.606	0.467	10.709	8.870
4	FoJ [9]	0.441	17.212	2.012	0.572	17.619	1.913
	Restormer [4]	0.268	14.918	3.431	0.222	14.920	3.478
	Boundary Attention [1]	0.452	17.785	2.031	0.560	18.103	1.926
	Ours	0.429	17.198	2.024	0.587	17.740	1.885
6	FoJ [9]	0.484	19.780	1.176	0.632	20.799	0.951
	Restormer [4]	0.357	17.496	2.042	0.289	17.800	1.903
	Boundary Attention [1]	0.510	20.542	1.130	0.641	21.633	0.908
	Ours	0.468	19.688	1.209	0.637	20.918	0.947
8	FoJ [9]	0.507	20.900	0.951	0.667	22.427	0.669
	Restormer [4]	0.430	19.258	1.406	0.353	19.706	1.289
	Boundary Attention [1]	0.541	21.768	0.872	0.684	23.500	0.604
	Ours	0.488	20.726	0.996	0.666	22.473	0.679

S3 Additional Qualitative Results

In this section, we show more qualitative comparison results in various photon levels $\alpha_{\text{test}}$ , including color maps, in Fig. S2 and Fig. S3. The proposed method produces more crisp boundaries than some other models, which even have higher ODS F1-scores in Tab. III.