CT-Bound: Robust Boundary Detection From Noisy Images Via Hybrid Convolution and Transformer Neural Networks

Wei Xu1, Junjie Luo, and Qi Guo * Corresponding author Elmore Family School of Electrical and Computer Engineering
Purdue University, West Lafayette, IN, USA
{xu1639,luo330,qiguo}@purdue.edu
Abstract

We present CT-Bound, a robust and fast boundary detection method for very noisy images using a hybrid Convolution and Transformer neural network. The proposed architecture decomposes boundary estimation into two tasks: local detection and global regularization. During the local detection, the model uses a convolutional architecture to predict the boundary structure of each image patch in the form of a pre-defined local boundary representation, the field-of-junctions (FoJ) [9]. Then, it uses a feed-forward transformer architecture to globally refine the boundary structures of each patch to generate an edge map and a smoothed color map simultaneously. Our quantitative analysis shows that CT-Bound outperforms the previous best algorithms in edge detection on very noisy images. It also increases the edge detection accuracy of FoJ-based methods while having a 3-time speed improvement. Finally, we demonstrate that CT-Bound can produce boundary and color maps on real captured images without extra fine-tuning and real-time boundary map and color map videos at ten frames per second.

Index Terms:
boundary estimation, image denoising, convolutional neural network, transformer
{strip}[Uncaptioned image]\captionof

figureBoundary detection from noisy images. Compared to a variety of models [1, 2, 3, 4, 5, 6, 7, 8, 9], ours robustly detects the boundaries even when they are visually challenging to discriminate.

I Introduction

Detecting boundary structures from very noisy images is a common and challenging computer vision problem [10]. There have been many applications that require boundary detection from images with very low light levels, such as medical imaging, manufacturing, autonomous navigation, etc. Although image boundary detection has been broadly studied since the early stage of computer vision [11, 12, 13, 6, 14], our results show that the accuracy of current best boundary detection algorithms are still unsatisfactory when the input images have a very low light level (Fig. CT-Bound: Robust Boundary Detection From Noisy Images Via Hybrid Convolution and Transformer Neural Networks).

We present CT-Bound. It is a deep neural network architecture that can robustly detect boundaries from a single noisy image. The model processes the input image to predict a generalized local boundary representation for each image patch called the field-of-junctions (FoJs) [9]. FoJ can represent a variety of boundary types in image patches, including edges, corners, and contours, and is an effective prior in edge detection, especially for noisy images [9, 1]. By constraining the predicted boundary structures to those that FoJ can describe, we observe our model can detect very faint edge signals in the presence of significant noise (Fig. CT-Bound: Robust Boundary Detection From Noisy Images Via Hybrid Convolution and Transformer Neural Networks). Our experiment shows that CT-Bound achieves the highest boundary detection accuracy among a variety of recent edge detection methods.

CT-Bound consists of an innovative two-stage, hybrid Convolution and Transformer neural network architecture (Fig. 2). The first stage is a convolutional architecture that makes an initial prediction of a local FoJ parameterization solely based on the visual appearance of each image patch. The second stage consists of a feedforward transformer encoder that takes in the initial FoJ estimation of all image patches to perform refinement. The architecture is novel as it completely decomposes boundary estimation into two tasks: detecting boundaries from local image patch and regularizing neighboring boundary estimations to ensure consistency and to look like natural boundaries. The convolutional network stage conducts boundary detection using a small receptive field (21×21212121\times 2121 × 21 in our experiments). Thus, it does not need to learn the global appearance of images and can be trained using synthetic image patches that only contain basic boundary structures. The transformer stage only receives the FoJ representation and has no access to the input image during inference. Therefore, the computational complexity of the transformer in the CT-Bound is significantly lower than that of the classic Vision Transformer [15].

Besides verifying the accuracy and robustness of the proposed algorithm using noisy images simulated from standard, benchmark datasets under different noise levels, we also test CT-Bound using noisy images captured using real-world cameras to generate real-time videos of boundary maps. The contribution of this paper can be summarized as follows.

  • A two-stage, hybrid neural network architecture;

  • A robust and fast and non-iterative solver of FoJ that enables real-time boundary detection on very noisy images;

  • A thorough experimental study that demonstrates CT-Bound achieves the highest or among the highest accuracy in detecting image boundaries from very noisy images compared to the previous best algorithms.

The code, the training data, the testing data, additional results, and the video demonstration of the proposed method can be found at https://github.com/guo-research-group/CT-Bound.

II Related Work

According to the classification by Gong et al. [16], image boundary detection methods can be divided into four categories, i.e., boundaries from luminance changes, texture changes, perceptual grou**, and illusory contour. We focus our literature review on the first category to which the proposed method belongs.

Local boundary detection

The first step of boundary detection is typically using specially designed filters to locate local responses of image boundaries. The filters either maximize the detectability and localization accuracy of the boundaries under noise, such as the Roberts cross operator [14], the Canny detectors [6], the Laplacian detectors [17], and the perfect matching filters [18] or are sensitive to the direction of the image boundaries, e.g., Sobel filters [11], Gaussian quadrature pairs [13], Steerable filters [19], etc. To robustly detect boundaries under non-ideal edges, people also develop sophisticated filters that are non-linear [20] or operate in multiple scales [21]. However, these classic methods based on local patches are found to be insufficient when the noise of the image is severe and have limited visual information to confidently determine the edges from the receptive field of a single filter. In this case, image smoothing is also not a solution as the smoothing operation will make faint or fine boundary structures to be indistinguishable [18].

Global boundary refinement

Boundaries in natural images are usually piece-wise smooth. Based on this observation, people have developed algorithms to refine the global boundary map from the locally detected boundaries by regularizing the curvature, for example, the squared curvature [22] or total curvature [23] along the boundary. Another observation for global boundary refinement is that the neighboring boundary maps must agree. There have been methods that use this intuition by enforcing neighboring consistency [9, 24, 12]. These global refinement methods are typically iterative and thus could have a high computational complexity.

Deep learning boundary detection

The emergence of deep learning has enabled people to develop deep neural networks that fuse the two steps into an end-to-end architecture learned from data. These methods directly output global boundary maps in nonparametric ways [3, 2, 7, 8, 5, 25, 26] or parametric ways [27, 28]. These methods typically outperform traditional, non-learning-based edge detection methods nowadays. There is a recent work, Boundary Attention, that also combines FoJ representation with deep neural networks [1]. It is targeted for detecting complicated, fine boundary structures and outputting more in-depth boundary information, such as edge-aware distance maps, from noisy images. Compared with Boundary Attention, CT-Bound focuses on boundary detection and achieves a higher boundary detection accuracy on benchmark datasets with a 3-time faster speed. Nonetheless, we suggest readers also read the paper [1] for a more comprehensive understanding of this field.

III Methods

Refer to caption
Figure 1: Field-of-junction (FoJ) representation [9].

We first briefly describe the FoJ representation [9] that we adopt in this work. Given an image patch Ph×w×k𝑃superscript𝑤𝑘P\in\mathbb{R}^{h\times w\times k}italic_P ∈ blackboard_R start_POSTSUPERSCRIPT italic_h × italic_w × italic_k end_POSTSUPERSCRIPT with dimension h×w𝑤h\times witalic_h × italic_w and k𝑘kitalic_k color channel, the FoJ models its boundary structure using a parameter set Φ=(𝒙,ϕ,𝒄)Φ𝒙bold-italic-ϕ𝒄\Phi=(\boldsymbol{x},\boldsymbol{\phi},\boldsymbol{c})roman_Φ = ( bold_italic_x , bold_italic_ϕ , bold_italic_c ), where 𝒙=(x0,y0)𝒙subscript𝑥0subscript𝑦0\boldsymbol{x}=(x_{0},y_{0})bold_italic_x = ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) indicates the center of the vertex, ϕ=(ϕ1,,ϕl)bold-italic-ϕsubscriptitalic-ϕ1subscriptitalic-ϕ𝑙\boldsymbol{\phi}=(\phi_{1},\cdots,\phi_{l})bold_italic_ϕ = ( italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_ϕ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) represents the angles of the l𝑙litalic_l edges, 𝒄=(𝒄1,,𝒄l),𝒄jk,j=1,,lformulae-sequence𝒄subscript𝒄1subscript𝒄𝑙formulae-sequencesubscript𝒄𝑗superscript𝑘𝑗1𝑙\boldsymbol{c}=(\boldsymbol{c}_{1},\cdots,\boldsymbol{c}_{l}),\boldsymbol{c}_{% j}\in\mathbb{R}^{k},j=1,\cdots,lbold_italic_c = ( bold_italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , bold_italic_c start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT , italic_j = 1 , ⋯ , italic_l are the color of the region between every pair of neighboring edges. The parameter l𝑙litalic_l is a hyperparameter that needs to be predetermined. See Fig. 1 for an exemplar illustration of FoJ. As shown in Verbin and Zickler, FoJ can represent a variety of local boundary structures, including edges, corners, and junctions [9]. Given an FoJ representation ΦΦ\Phiroman_Φ, the corresponding boundary map of the patch B(x,y;Φ)𝐵𝑥𝑦ΦB(x,y;\Phi)italic_B ( italic_x , italic_y ; roman_Φ ) can be plotted via:

B(x,y;Φ)=πϵH2,ϵ(min(dj(x,y))),𝐵𝑥𝑦Φ𝜋italic-ϵsuperscriptsubscript𝐻2italic-ϵsubscript𝑑𝑗𝑥𝑦\displaystyle B(x,y;\Phi)=\pi\epsilon H_{2,\epsilon}^{\prime}(\min(d_{j}(x,y))),italic_B ( italic_x , italic_y ; roman_Φ ) = italic_π italic_ϵ italic_H start_POSTSUBSCRIPT 2 , italic_ϵ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( roman_min ( italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x , italic_y ) ) ) , (1)

where dj(x,y)subscript𝑑𝑗𝑥𝑦d_{j}(x,y)italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x , italic_y ) is the distance from the pixel (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) to the edge j𝑗jitalic_j and H2,ϵsuperscriptsubscript𝐻2italic-ϵH_{2,\epsilon}^{\prime}italic_H start_POSTSUBSCRIPT 2 , italic_ϵ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the derivative of Heaviside function H2,ϵsubscript𝐻2italic-ϵH_{2,\epsilon}italic_H start_POSTSUBSCRIPT 2 , italic_ϵ end_POSTSUBSCRIPT in [29]:

H2,ϵ(d)=12(1+2πarctandϵ),subscript𝐻2italic-ϵ𝑑1212𝜋𝑑italic-ϵ\displaystyle H_{2,\epsilon}(d)=\frac{1}{2}\left(1+\frac{2}{\pi}\arctan\frac{d% }{\epsilon}\right),italic_H start_POSTSUBSCRIPT 2 , italic_ϵ end_POSTSUBSCRIPT ( italic_d ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 1 + divide start_ARG 2 end_ARG start_ARG italic_π end_ARG roman_arctan divide start_ARG italic_d end_ARG start_ARG italic_ϵ end_ARG ) , (2)

where ϵitalic-ϵ\epsilonitalic_ϵ is a smoothing parameter that we set ϵ=0.01italic-ϵ0.01\epsilon=0.01italic_ϵ = 0.01 throughout our experiments. The color map C(x,y;Φ)𝐶𝑥𝑦ΦC(x,y;\Phi)italic_C ( italic_x , italic_y ; roman_Φ ) can be visualized using:

C(x,y;Φ)=j=1lδj(x,y)𝒄j,𝐶𝑥𝑦Φsuperscriptsubscript𝑗1𝑙subscript𝛿𝑗𝑥𝑦subscript𝒄𝑗\displaystyle C(x,y;\Phi)=\sum_{j=1}^{l}\delta_{j}(x,y)\boldsymbol{c}_{j},italic_C ( italic_x , italic_y ; roman_Φ ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x , italic_y ) bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , (3)

where δj(x,y)=1subscript𝛿𝑗𝑥𝑦1\delta_{j}(x,y)=1italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x , italic_y ) = 1 when the pixel (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) is within the wedge between edge j𝑗jitalic_j and j+1𝑗1j+1italic_j + 1 in the patch P𝑃Pitalic_P:

Ωj={\displaystyle\Omega_{j}=\{roman_Ω start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = { (x,y)|(x,y)P,conditional𝑥𝑦𝑥𝑦𝑃\displaystyle(x,y)|(x,y)\in P,( italic_x , italic_y ) | ( italic_x , italic_y ) ∈ italic_P , (4)
(xx0)cosϕj+(yy0)sinϕj<0,𝑥subscript𝑥0subscriptitalic-ϕ𝑗𝑦subscript𝑦0subscriptitalic-ϕ𝑗0\displaystyle(x-x_{0})\cos\phi_{j}+(y-y_{0})\sin\phi_{j}<0,( italic_x - italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_cos italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + ( italic_y - italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_sin italic_ϕ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT < 0 ,
(xx0)cosϕj+1+(yy0)sinϕj+1>0},\displaystyle(x-x_{0})\cos\phi_{j+1}+(y-y_{0})\sin\phi_{j+1}>0\},( italic_x - italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_cos italic_ϕ start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT + ( italic_y - italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) roman_sin italic_ϕ start_POSTSUBSCRIPT italic_j + 1 end_POSTSUBSCRIPT > 0 } ,

and δj(x,y)=0subscript𝛿𝑗𝑥𝑦0\delta_{j}(x,y)=0italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_x , italic_y ) = 0 otherwise.

III-A Network architecture

Refer to caption

Figure 2: Network architecture of CT-Bound. The architecture consists of two stages. The initialization stage contains shared-weights convolutional neural networks that output the FoJ representation of every image patch. The refinement stage contains a transformer encoder that simultaneously refines all per-patch FoJ representations. Finally, the framework combines all per-patch FoJ representations together to output the global boundary map and the color map.

The network architecture of the proposed method is visualized in Fig. 2. Given a noisy image IH×W×k𝐼superscript𝐻𝑊𝑘I\in\mathbb{R}^{H\times W\times k}italic_I ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_k end_POSTSUPERSCRIPT, CT-Bound first divides the image I𝐼Iitalic_I into overlap** patches. The initialization stage takes in each image patch Pm,nsubscript𝑃𝑚𝑛P_{m,n}italic_P start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT into a CNN to generate the initial vertex location and the edge angles of the FoJ representation (𝒙m,ninit,ϕm,ninit)subscriptsuperscript𝒙init𝑚𝑛subscriptsuperscriptbold-italic-ϕinit𝑚𝑛(\boldsymbol{x}^{\text{init}}_{m,n},\boldsymbol{\phi}^{\text{init}}_{m,n})( bold_italic_x start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT , bold_italic_ϕ start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ). Then, the method determines the color parameters 𝒄m,ninit=(𝒄m,n,1init,,𝒄m,n,linit)subscriptsuperscript𝒄init𝑚𝑛subscriptsuperscript𝒄init𝑚𝑛1subscriptsuperscript𝒄init𝑚𝑛𝑙\boldsymbol{c}^{\text{init}}_{m,n}=(\boldsymbol{c}^{\text{init}}_{m,n,1},% \cdots,\boldsymbol{c}^{\text{init}}_{m,n,l})bold_italic_c start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT = ( bold_italic_c start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n , 1 end_POSTSUBSCRIPT , ⋯ , bold_italic_c start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n , italic_l end_POSTSUBSCRIPT ) mathematically by averaging the color of pixels of each divided area of the patch:

𝒄m,n,jinit=1|Ωm,n,j|(x,y)Ωm,n,jPm,n(x,y,:),subscriptsuperscript𝒄init𝑚𝑛𝑗1subscriptΩ𝑚𝑛𝑗subscript𝑥𝑦subscriptΩ𝑚𝑛𝑗subscript𝑃𝑚𝑛𝑥𝑦:\displaystyle\boldsymbol{c}^{\text{init}}_{m,n,j}=\frac{1}{|\Omega_{m,n,j}|}% \sum_{(x,y)\in\Omega_{m,n,j}}P_{m,n}(x,y,:),bold_italic_c start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n , italic_j end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | roman_Ω start_POSTSUBSCRIPT italic_m , italic_n , italic_j end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ roman_Ω start_POSTSUBSCRIPT italic_m , italic_n , italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ( italic_x , italic_y , : ) , (5)

where Ωm,n,jsubscriptΩ𝑚𝑛𝑗\Omega_{m,n,j}roman_Ω start_POSTSUBSCRIPT italic_m , italic_n , italic_j end_POSTSUBSCRIPT indicates the set of pixels in the wedge between edges j𝑗jitalic_j and j+1𝑗1j+1italic_j + 1 in the patch Pm,nsubscript𝑃𝑚𝑛P_{m,n}italic_P start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT, as defined in (4).

The refinement stage takes in the initial FoJ representation (𝒙m,ninit,(\boldsymbol{x}^{\text{init}}_{m,n},( bold_italic_x start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT , ϕm,ninit,subscriptsuperscriptbold-italic-ϕinit𝑚𝑛\boldsymbol{\phi}^{\text{init}}_{m,n},bold_italic_ϕ start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT , 𝒄m,ninit)\boldsymbol{c}^{\text{init}}_{m,n})bold_italic_c start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ) of all patches Pm,nsubscript𝑃𝑚𝑛P_{m,n}italic_P start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT simultaneously. It first converts each initial FoJ representation (𝒙m,ninit,ϕm,ninit,(\boldsymbol{x}^{\text{init}}_{m,n},\boldsymbol{\phi}^{\text{init}}_{m,n},( bold_italic_x start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT , bold_italic_ϕ start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT , 𝒄m,ninit)\boldsymbol{c}^{\text{init}}_{m,n})bold_italic_c start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ) into a feature vector representation 𝒗m,ndsubscript𝒗𝑚𝑛superscript𝑑\boldsymbol{v}_{m,n}\in\mathbb{R}^{d}bold_italic_v start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, and applies positional encoding by adding a positional vector 𝒑m,n=[pm,n,1,,pm,n,d]Tsubscript𝒑𝑚𝑛superscriptsubscript𝑝𝑚𝑛1subscript𝑝𝑚𝑛𝑑𝑇\boldsymbol{p}_{m,n}=[p_{m,n,1},\cdots,p_{m,n,d}]^{T}bold_italic_p start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT = [ italic_p start_POSTSUBSCRIPT italic_m , italic_n , 1 end_POSTSUBSCRIPT , ⋯ , italic_p start_POSTSUBSCRIPT italic_m , italic_n , italic_d end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT to the feature vector 𝒗m,nsubscript𝒗𝑚𝑛\boldsymbol{v}_{m,n}bold_italic_v start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT to incorporate the positional information of each image patch. The 2D positional encoding vector follows the design of Zhang and Liu [30]:

pm,n,isubscript𝑝𝑚𝑛𝑖\displaystyle p_{m,n,i}italic_p start_POSTSUBSCRIPT italic_m , italic_n , italic_i end_POSTSUBSCRIPT ={sin(m100004i/D),i=0,2,4,,d/2cos(m100004i/D),i=1,3,5,,d/2+1sin(n100004i/D),i=d/2,d/2+2,,d2cos(n100004i/D),i=d/2+1,d/2+3,,d1absentcasesformulae-sequence𝑚superscript100004𝑖𝐷𝑖024𝑑2otherwiseformulae-sequence𝑚superscript100004𝑖𝐷𝑖135𝑑21otherwiseformulae-sequence𝑛superscript100004𝑖𝐷𝑖𝑑2𝑑22𝑑2otherwiseformulae-sequence𝑛superscript100004𝑖𝐷𝑖𝑑21𝑑23𝑑1otherwise\displaystyle=\begin{cases}\sin\left(\frac{m}{10000^{4i/D}}\right),i=0,2,4,% \cdots,d/2\\ \cos\left(\frac{m}{10000^{4i/D}}\right),i=1,3,5,\cdots,d/2+1\\ \sin\left(\frac{n}{10000^{4i/D}}\right),i=d/2,d/2+2,\cdots,d-2\\ \cos\left(\frac{n}{10000^{4i/D}}\right),i=d/2+1,d/2+3,\cdots,d-1\end{cases}= { start_ROW start_CELL roman_sin ( divide start_ARG italic_m end_ARG start_ARG 10000 start_POSTSUPERSCRIPT 4 italic_i / italic_D end_POSTSUPERSCRIPT end_ARG ) , italic_i = 0 , 2 , 4 , ⋯ , italic_d / 2 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL roman_cos ( divide start_ARG italic_m end_ARG start_ARG 10000 start_POSTSUPERSCRIPT 4 italic_i / italic_D end_POSTSUPERSCRIPT end_ARG ) , italic_i = 1 , 3 , 5 , ⋯ , italic_d / 2 + 1 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL roman_sin ( divide start_ARG italic_n end_ARG start_ARG 10000 start_POSTSUPERSCRIPT 4 italic_i / italic_D end_POSTSUPERSCRIPT end_ARG ) , italic_i = italic_d / 2 , italic_d / 2 + 2 , ⋯ , italic_d - 2 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL roman_cos ( divide start_ARG italic_n end_ARG start_ARG 10000 start_POSTSUPERSCRIPT 4 italic_i / italic_D end_POSTSUPERSCRIPT end_ARG ) , italic_i = italic_d / 2 + 1 , italic_d / 2 + 3 , ⋯ , italic_d - 1 end_CELL start_CELL end_CELL end_ROW

where the dimension of each feature vector d𝑑ditalic_d is an even number. All positional encoded feature vectors are fed into a transformer encoder consisting of a series of multi-head attention layers to refine the boundary consistency among patches globally and adjust unnatural boundary estimations. It is the only block in the framework that globally shares the per-patch FoJ information. Finally, the framework outputs the refined vertex location and edge angles of all patches (𝒙m,nref,ϕm,nref),m,nsubscriptsuperscript𝒙ref𝑚𝑛subscriptsuperscriptbold-italic-ϕref𝑚𝑛for-all𝑚𝑛(\boldsymbol{x}^{\text{ref}}_{m,n},\boldsymbol{\phi}^{\text{ref}}_{m,n}),% \forall m,n( bold_italic_x start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT , bold_italic_ϕ start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ) , ∀ italic_m , italic_n, and calculate the refined color parameters 𝒄m,nrefsubscriptsuperscript𝒄ref𝑚𝑛\boldsymbol{c}^{\text{ref}}_{m,n}bold_italic_c start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT using (5). We list the network hyperparameters we use in our experiment in Tab. I. As the transformer does not operate on the image domain, the dimension of the input vector and the number of layers are much smaller compared to the classic Vision Transformer [15].

TABLE I: CT-Bound model hyperparameter.
Convolutional Neural Network
Layer Specification Output
Conv2d 5×\times×5 kernel, 4 stride, 2 pad (21, 21, 96)
MaxPool2d 3×\times×3 kernel, 2 stride, 0 pad (10, 10, 96)
Conv2d 5×\times×5 kernel, 1 stride, 2 pad (10, 10, 256)
MaxPool2d 2×\times×2 kernel, 2 stride, 0 pad (5, 5, 256)
Conv2d 3×\times×3 kernel, 1 stride, 1 pad (5, 5, 384)
Conv2d 3×\times×3 kernel, 1 stride, 1 pad (5, 5, 384)
Conv2d 3×\times×3 kernel, 1 stride, 1 pad (5, 5, 256)
MaxPool2d 3×\times×3 kernel, 2 stride, 0 pad (2, 2, 256)
FC - 4096
FC - 1024
FC - 5
Transformer Encoder
Specification Parameter
Dimension of each input vector 128
Number of layers 8
Number of heads in each layer 8
Dimension of the feed-forward layer 256

From the refined FoJ parameters Φm,nref=(𝒙m,nref,\Phi^{\text{ref}}_{m,n}=(\boldsymbol{x}^{\text{ref}}_{m,n},roman_Φ start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT = ( bold_italic_x start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT , ϕm,nref,subscriptsuperscriptbold-italic-ϕref𝑚𝑛\boldsymbol{\phi}^{\text{ref}}_{m,n},bold_italic_ϕ start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT , 𝒄m,nref)\boldsymbol{c}^{\text{ref}}_{m,n})bold_italic_c start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ), CT-Bound generates the per-patch boundary and color map according to (1) and (3), and computes the global boundary map B(x,y)𝐵𝑥𝑦B(x,y)italic_B ( italic_x , italic_y ) by averaging the per-patch boundary maps [9]:

B(x,y)=1N(x,y)N(x,y)B(x,y;Φm,nref),𝐵𝑥𝑦1𝑁𝑥𝑦subscript𝑁𝑥𝑦𝐵𝑥𝑦subscriptsuperscriptΦref𝑚𝑛\displaystyle B(x,y)=\frac{1}{N(x,y)}\sum_{N(x,y)}B(x,y;\Phi^{\text{ref}}_{m,n% }),italic_B ( italic_x , italic_y ) = divide start_ARG 1 end_ARG start_ARG italic_N ( italic_x , italic_y ) end_ARG ∑ start_POSTSUBSCRIPT italic_N ( italic_x , italic_y ) end_POSTSUBSCRIPT italic_B ( italic_x , italic_y ; roman_Φ start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ) , (6)

and the global color map C(x,y)𝐶𝑥𝑦C(x,y)italic_C ( italic_x , italic_y ) via a specific smoothing operation over the per-patch color maps:

C(x,y)=1|N(x,y)|N(x,y)δm,n,j(x,y)𝒄m,n,jref,𝐶𝑥𝑦1𝑁𝑥𝑦subscript𝑁𝑥𝑦subscript𝛿𝑚𝑛𝑗𝑥𝑦subscriptsuperscript𝒄ref𝑚𝑛𝑗\displaystyle C(x,y)=\frac{1}{\left|N(x,y)\right|}\sum_{N(x,y)}\delta_{m,n,j}(% x,y)\boldsymbol{c}^{\text{ref}}_{m,n,j},italic_C ( italic_x , italic_y ) = divide start_ARG 1 end_ARG start_ARG | italic_N ( italic_x , italic_y ) | end_ARG ∑ start_POSTSUBSCRIPT italic_N ( italic_x , italic_y ) end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_m , italic_n , italic_j end_POSTSUBSCRIPT ( italic_x , italic_y ) bold_italic_c start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n , italic_j end_POSTSUBSCRIPT , (7)

where N(x,y)={(m,n)|(x,y)Ωm,n,j}𝑁𝑥𝑦conditional-set𝑚𝑛𝑥𝑦subscriptΩ𝑚𝑛𝑗N(x,y)=\left\{(m,n)|(x,y)\in\Omega_{m,n,j}\right\}italic_N ( italic_x , italic_y ) = { ( italic_m , italic_n ) | ( italic_x , italic_y ) ∈ roman_Ω start_POSTSUBSCRIPT italic_m , italic_n , italic_j end_POSTSUBSCRIPT } is the set of the patch indices that contain (x,y)𝑥𝑦(x,y)( italic_x , italic_y ), δm,n,jsubscript𝛿𝑚𝑛𝑗\delta_{m,n,j}italic_δ start_POSTSUBSCRIPT italic_m , italic_n , italic_j end_POSTSUBSCRIPT is a binary indicator that is 1111 if pixel (x,y)𝑥𝑦(x,y)( italic_x , italic_y ) belongs to the wedge Ωm,n,jsubscriptΩ𝑚𝑛𝑗\Omega_{m,n,j}roman_Ω start_POSTSUBSCRIPT italic_m , italic_n , italic_j end_POSTSUBSCRIPT and 00 otherwise, and 𝒄m,n,jrefsubscriptsuperscript𝒄ref𝑚𝑛𝑗\boldsymbol{c}^{\text{ref}}_{m,n,j}bold_italic_c start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n , italic_j end_POSTSUBSCRIPT is the refined color of wedge Ωm,n,jsubscriptΩ𝑚𝑛𝑗\Omega_{m,n,j}roman_Ω start_POSTSUBSCRIPT italic_m , italic_n , italic_j end_POSTSUBSCRIPT.

III-B Loss functions

We develop a multi-stage training scheme to optimize the parameters of CT-Bound. First, we train the initialization stage using the patch reconstruction loss:

init=𝔼P(MSE(C(x,y;Φgt),C(x,y;Φinit))),subscriptinitsubscript𝔼𝑃MSE𝐶𝑥𝑦superscriptΦgt𝐶𝑥𝑦superscriptΦinit\displaystyle\mathcal{L}_{\text{init}}=\mathbb{E}_{P}\left(\mathrm{MSE}\left(C% \left(x,y;\Phi^{\text{gt}}\right),C\left(x,y;\Phi^{\text{init}}\right)\right)% \right),caligraphic_L start_POSTSUBSCRIPT init end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( roman_MSE ( italic_C ( italic_x , italic_y ; roman_Φ start_POSTSUPERSCRIPT gt end_POSTSUPERSCRIPT ) , italic_C ( italic_x , italic_y ; roman_Φ start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT ) ) ) , (8)

where 𝔼Psubscript𝔼𝑃\mathbb{E}_{P}blackboard_E start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT denotes the expectation over all patches in the training set, and C(x,y;C(x,y;italic_C ( italic_x , italic_y ; Φgt)\Phi^{\text{gt}})roman_Φ start_POSTSUPERSCRIPT gt end_POSTSUPERSCRIPT ) and C(x,y;Φinit)𝐶𝑥𝑦superscriptΦinitC(x,y;\Phi^{\text{init}})italic_C ( italic_x , italic_y ; roman_Φ start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT ) indicate the per-patch color maps reconstructed using true and estimated FoJ parameters, respectively. We observe that the visual quality of the FoJ estimation is higher when using the loss in (8) for training than directly supervising the FoJ parameters. Furthermore, because the CNN in the initialization stage has a small receptive field, we can use synthetic image patches of basic shapes to train it, and we observe that the trained model can be generalized to real-world image patches without further fine-tuning.

When optimizing parameters of the refinement stage, we use a fixed, pre-trained initialization stage to generate inputs ΦinitsuperscriptΦinit\Phi^{\text{init}}roman_Φ start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT. We invent a two-step training process for optimizing the refinement stage, which was noticed to lead to a more stable and faster convergence. In the first step, a mean squared error loss function is adopted to supervise the estimated FoJ parameters directly:

ref1=𝔼P(𝒙gt𝒙ref2+ϕgtϕref2).subscriptref1subscript𝔼𝑃superscriptdelimited-∥∥superscript𝒙𝑔𝑡superscript𝒙ref2superscriptdelimited-∥∥superscriptbold-italic-ϕ𝑔𝑡superscriptbold-italic-ϕref2\displaystyle\mathcal{L}_{\text{ref1}}=\mathbb{E}_{P}\left(\lVert\boldsymbol{x% }^{gt}-\boldsymbol{x}^{\text{ref}}\rVert^{2}+\lVert\boldsymbol{\phi}^{gt}-% \boldsymbol{\phi}^{\text{ref}}\rVert^{2}\right).caligraphic_L start_POSTSUBSCRIPT ref1 end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( ∥ bold_italic_x start_POSTSUPERSCRIPT italic_g italic_t end_POSTSUPERSCRIPT - bold_italic_x start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ bold_italic_ϕ start_POSTSUPERSCRIPT italic_g italic_t end_POSTSUPERSCRIPT - bold_italic_ϕ start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (9)

In the second step, we use a comprehensive image reconstruction loss adapted from Verbin and Zickler [9]:

ref2=𝔼I(lp+λblb+λclc),subscriptref2subscript𝔼𝐼subscript𝑙𝑝subscript𝜆𝑏subscript𝑙𝑏subscript𝜆𝑐subscript𝑙𝑐\displaystyle\mathcal{L}_{\text{ref2}}=\mathbb{E}_{I}\left(l_{p}+\lambda_{b}l_% {b}+\lambda_{c}l_{c}\right),caligraphic_L start_POSTSUBSCRIPT ref2 end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ( italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , (10)

where 𝔼Isubscript𝔼𝐼\mathbb{E}_{I}blackboard_E start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT indicates the expectation over all images in the dataset, and lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, lbsubscript𝑙𝑏l_{b}italic_l start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, and lcsubscript𝑙𝑐l_{c}italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT are patch, boundary, and color loss terms, respectively:

lpsubscript𝑙𝑝\displaystyle l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT =m,nj=1lx,yδm,n,j(x,y)𝒄m,n,jrefI(x,y)2,absentsubscript𝑚𝑛superscriptsubscript𝑗1𝑙subscript𝑥𝑦subscript𝛿𝑚𝑛𝑗𝑥𝑦superscriptdelimited-∥∥subscriptsuperscript𝒄ref𝑚𝑛𝑗𝐼𝑥𝑦2\displaystyle=\sum_{m,n}\sum_{j=1}^{l}\sum_{x,y}\delta_{m,n,j}(x,y)\lVert% \boldsymbol{c}^{\text{ref}}_{m,n,j}-I(x,y)\rVert^{2},= ∑ start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_m , italic_n , italic_j end_POSTSUBSCRIPT ( italic_x , italic_y ) ∥ bold_italic_c start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n , italic_j end_POSTSUBSCRIPT - italic_I ( italic_x , italic_y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
lbsubscript𝑙𝑏\displaystyle l_{b}italic_l start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT =m,nx,y(B(x,y)B(x,y;Φm,nref))2,absentsubscript𝑚𝑛subscript𝑥𝑦superscript𝐵𝑥𝑦𝐵𝑥𝑦subscriptsuperscriptΦref𝑚𝑛2\displaystyle=\sum_{m,n}\sum_{x,y}\left(B(x,y)-B(x,y;\Phi^{\text{ref}}_{m,n})% \right)^{2},= ∑ start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT ( italic_B ( italic_x , italic_y ) - italic_B ( italic_x , italic_y ; roman_Φ start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
lcsubscript𝑙𝑐\displaystyle l_{c}italic_l start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT =m,nj=1lx,yδm,n,j(x,y)𝒄m,n,jrefC(x,y)2.absentsubscript𝑚𝑛superscriptsubscript𝑗1𝑙subscript𝑥𝑦subscript𝛿𝑚𝑛𝑗𝑥𝑦superscriptdelimited-∥∥subscriptsuperscript𝒄ref𝑚𝑛𝑗𝐶𝑥𝑦2\displaystyle=\sum_{m,n}\sum_{j=1}^{l}\sum_{x,y}\delta_{m,n,j}(x,y)\lVert% \boldsymbol{c}^{\text{ref}}_{m,n,j}-C(x,y)\rVert^{2}.= ∑ start_POSTSUBSCRIPT italic_m , italic_n end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_x , italic_y end_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_m , italic_n , italic_j end_POSTSUBSCRIPT ( italic_x , italic_y ) ∥ bold_italic_c start_POSTSUPERSCRIPT ref end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m , italic_n , italic_j end_POSTSUBSCRIPT - italic_C ( italic_x , italic_y ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

In Verbin and Zickler, the loss in (10) was solved in an alternating, two-step fashion to refine the FoJ representation iteratively [9]. We evaluate the loss in a single step and show it can successfully fine-tune the feed-forward transformer encoder to improve the FoJ representation.

IV Experimental Results

IV-A Data processing

For the training of the initialization stage, we use randomly sampled patches from FoJ synthetic datasets [9] that only contain images of basic shapes such as squares. We select 8000800080008000 image patches for training and 2000200020002000 for testing. To simulate image noise, we apply a Poisson-Gaussian process to each image patch [31]:

P(x,y)=Poisson(αP(x,y))+Gaussian(0,σ2),𝑃𝑥𝑦Poisson𝛼superscript𝑃𝑥𝑦Gaussian0superscript𝜎2\displaystyle P(x,y)=\mathrm{Poisson}\left(\alpha P^{*}(x,y)\right)+\mathrm{% Gaussian}\left(0,\sigma^{2}\right),italic_P ( italic_x , italic_y ) = roman_Poisson ( italic_α italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x , italic_y ) ) + roman_Gaussian ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (11)

where P(x,y)𝑃𝑥𝑦P(x,y)italic_P ( italic_x , italic_y ) and P(x,y)[0,1]superscript𝑃𝑥𝑦01P^{*}(x,y)\in[0,1]italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x , italic_y ) ∈ [ 0 , 1 ] are the noisy and normalized clean image patches, α𝛼\alphaitalic_α is the photon level parameter that controls the noise of the image, and σ𝜎\sigmaitalic_σ is the standard deviation of read noise (σ=2𝜎2\sigma=2italic_σ = 2 is applied as in [32]). For the refinement stage, we use images from MS COCO [33] for training and testing. The training and testing sets contain 1600160016001600 and 400400400400 randomly selected, non-overlap** images, respectively. Each image is cropped at the center to the size of 147×147147147147\times 147147 × 147 and is applied with the same noise as described in (11). In our experiment, we randomly set the photon level α𝛼\alphaitalic_α within the range [2,10]210[2,10][ 2 , 10 ] to generate images with a variety of noise levels.

Refer to caption
Figure 3: Effect of refinement and boundary selection. (a-b) An input noisy image and the corresponding clean image from the MS COCO dataset [33]. (c-d) The color map and boundary map before the refinement stage. (e-f) The color map and boundary map after the refinement stage. Noisy edge estimations are removed in refinement, and the color map is smoother.
TABLE II: ODS F1-score of CT-Bound before (numbers in red) and after (numbers in blue) the refinement.
Photon level αtestsubscript𝛼test\alpha_{\text{test}}italic_α start_POSTSUBSCRIPT test end_POSTSUBSCRIPT Dataset
BSDS500 [12] NYUDv2 [34]
2 0.482 0.541 0.479 0.552
4 0.509 0.627 0.522 0.633
6 0.518 0.640 0.538 0.646
8 0.524 0.633 0.546 0.647

We evaluate CT-Bound on the testing sets of Berkeley Segmentation Data Set 500 (BSDS500) [12] and NYU Depth Dataset V2 (NYUDv2) [34]. We crop images to 147×147147147147\times 147147 × 147 size and add noise as above. BSDS500 has 200 testing images. For NYUDv2, 200 images are randomly selected from its testing set split and adopted by in [35]. Using different datasets for evaluation demonstrates the generalizability of our model.

IV-B Implementation details

We use l=3𝑙3l=3italic_l = 3 in our implementation. All optimizations in this work use the Adam optimizer [36]. The initialization stage is trained with an initial learning rate of 0.00020.00020.00020.0002 and a decay of 0.50.50.50.5 every 80808080 epochs. The batch size is 32323232, and the total number of training epochs is 900900900900. We use a two-step scheme to train the refinement stage as described in Sec. III-B. Both steps use a batch size 16161616. The first step uses (9) as the objective function and has 100100100100 epochs. a learning rate 5×1055superscript1055\times 10^{-5}5 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. The second step switches to (10) as its loss function and runs 1600160016001600 epochs. The learning rate for the second step is updated with a triangular cycle between 1.75×1041.75superscript1041.75\times 10^{-4}1.75 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and 3.5×1043.5superscript1043.5\times 10^{-4}3.5 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. The training and testing are performed on a machine with an NVIDIA GeForce RTX A5000 graphics card and 24 GB memory.

The fixed contour threshold (ODS) F1-score is recorded during evaluation, with a non-maximum suppression [6] applied in advance. We adjust the localization tolerance proportionally based on [5] to accommodate the image size in our experiment, setting it to 0.0209 for BSDS500 and 0.0372 for NYUDv2.

IV-C Ablation study

TABLE III: Quantitative comparison of boundary detection on noisy images synthesized from BSDS500 [12] and NYUDv2 [34] datasets. The numbers are ODS F1-score. The proposed method and Boundary Attention [1] demonstrate the highest or the second highest boundary detection accuracy on both datasets when the image is very noisy, i.e., αtest=2subscript𝛼test2\alpha_{\text{test}}=2italic_α start_POSTSUBSCRIPT test end_POSTSUBSCRIPT = 2. Meanwhile, ours is three times faster than Boundary Attention [1]. When the noise level becomes lower, i.e., αtestsubscript𝛼test\alpha_{\text{test}}italic_α start_POSTSUBSCRIPT test end_POSTSUBSCRIPT being larger, Restormer [4]\rightarrowHED [5] starts to perform better. This result suggests the robustness of our method in detecting boundaries at extremely high noise levels.
Model Publication’Year BSDS500 [12] NYUDv2 [34] FPS
Photon level αtestsubscript𝛼test\alpha_{\text{test}}italic_α start_POSTSUBSCRIPT test end_POSTSUBSCRIPT §OpenCV-CPU
2 4 6 8 2 4 6 8 PyTorch-GPU JAX-GPU
Canny [6] PAMI’86 0.493 0.489 0.383 0.495 0.476 0.467 0.325 0.482 625§
HED [5] ICCV’15 0.327 0.388 0.456 0.520 0.307 0.363 0.434 0.484 18.8§
FoJ [9] ICCV’21 0.509 0.564 0.597 0.611 0.529 0.576 0.597 0.613 1/68
PiDiNet [8] ICCV’21 0.316 0.356 0.447 0.480 0.274 0.311 0.409 0.456 125
EDTER [7] CVPR’22 0.484 0.509 0.545 0.594 0.488 0.478 0.534 0.581 15.6
Restormer [4]\rightarrowCanny [6] CVPR’22+PAMI’86 0.490 0.521 0.518 0.516 0.511 0.533 0.503 0.499 9.6
Restormer [4]\rightarrowHED [5] CVPR’22+ICCV’15 0.459 0.628 0.674 0.707 0.474 0.625 0.647 0.663 6.4
UAED [3] CVPR’23 0.473 0.553 0.618 0.665 0.452 0.557 0.620 0.652 28.7
PEdger [2] ACMMM’23 0.525 0.611 0.657 0.684 0.509 0.606 0.632 0.650 14.4
Boundary Attention [1] arXiv’24 0.534 0.591 0.607 0.615 0.572 0.609 0.624 0.625 4.3
Ours - 0.541 0.627 0.640 0.633 0.552 0.633 0.646 0.647 11.4, 15.2
Refer to caption
Figure 4: Qualitative comparison of noisy images synthesized from BSDS500 [12] and NYUDv2 [34] datasets with photon level αtest=2subscript𝛼test2\alpha_{\text{test}}=2italic_α start_POSTSUBSCRIPT test end_POSTSUBSCRIPT = 2. The proposed method shows robustness to the high noise level, while other methods fail to produce accurate boundaries. Additionally, ours can detect faint boundaries that are visually invisible.

We analyze the benefit offered by the refinement stage of CT-Bound. As shown in Fig. 3, the refinement stage attenuates noisy and inconsistent boundary estimations and strengthens real boundaries compared to the boundary map from the initialization stage. It also makes the color map appear smoother and sharper at color boundaries. The quantitative analysis is shown in Tab. II. It draws the same conclusion: the refinement stage increases the ODS F1-score of the boundary map compared to the initialization stage.

IV-D Analysis on synthetic and real images

We compare the proposed method with the iterative FoJ solver [9], the traditional edge detector Canny [6], and other learning-based models, including Boundary Attention [1], PEdger [2], UAED [3], Restormer [4]\rightarrowHED [5], Restormer [4]\rightarrowCanny [6], EDTER [7], PiDiNet [8], and HED [5]. Tab. III shows the quantitative comparison of these methods. Note that our model is trained using images from MS COCO dataset [33] with random photon level αtrain[2,10]subscript𝛼train210\alpha_{\text{train}}\in[2,10]italic_α start_POSTSUBSCRIPT train end_POSTSUBSCRIPT ∈ [ 2 , 10 ] and evaluated using images from BSDS500 [12] and NYUDv2 [34] datasets on a specific photon level αtestsubscript𝛼test\alpha_{\text{test}}italic_α start_POSTSUBSCRIPT test end_POSTSUBSCRIPT. The proposed approach achieves the highest or near-highest ODS F1-score when the noise level is very high, i.e. α=2𝛼2\alpha=2italic_α = 2. Fig. 4 shows sample boundary maps estimated from noisy images synthesized from BSDS500 [12] and NYUDv2 [34] datasets. Both results indicate the robustness and generalizability of the proposed method to different image datasets and noise levels.

Refer to caption
Figure 5: Comparison using real captured images. The image is taken at the exposure of 1/10900s and ISO 40.

Fig. 5 shows the boundary maps and color maps estimated from a real image captured by an iPhone 13 Mini camera with a high shutter speed. The boundary map and color map generated from the proposed method have high visual quality without any fine-tuning to the real images. We also upload a video of boundary maps and color maps of a real captured video clip generated by CT-Bound to the URL listed in Sec. I.

V Conclusion and Limitation

In this paper, we propose a two-stage boundary detector, CT-Bound, which is a hybrid neural network architecture aiming to achieve robust and accurate boundary detection on extremely noisy images in a single shot. Compared to a variety of models, our method demonstrates the highest or near-highest boundary detection accuracy on benchmark datasets, producing visually clean and crisp boundaries. A limitation we observe is that, when processing videos, there are abrupt changes in detected boundaries across frames. This is because CT-Bound only uses a single frame for processing. The problem can be resolved by introducing temporal consistencies in the model.

References

  • [1] Mia Gaia Polansky, Charles Herrmann, Junhwa Hur, Deqing Sun, Dor Verbin, and Todd Zickler, “Boundary attention: Learning to find faint boundaries at any resolution,” arXiv preprint arXiv:2401.00935, 2023.
  • [2] Yuanbin Fu and Xiaojie Guo, “Practical edge detection via robust collaborative learning,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 2526–2534.
  • [3] Caixia Zhou, Ya** Huang, Mengyang Pu, Qingji Guan, Li Huang, and Haibin Ling, “The treasure beneath multiple annotations: An uncertainty-aware edge detector,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15507–15517.
  • [4] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang, “Restormer: Efficient transformer for high-resolution image restoration,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5728–5739.
  • [5] Saining Xie and Zhuowen Tu, “Holistically-nested edge detection,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1395–1403.
  • [6] John Canny, “A computational approach to edge detection,” IEEE Transactions on pattern analysis and machine intelligence, , no. 6, pp. 679–698, 1986.
  • [7] Mengyang Pu, Ya** Huang, Yuming Liu, Qingji Guan, and Haibin Ling, “Edter: Edge detection with transformer,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1402–1412.
  • [8] Zhuo Su, Wenzhe Liu, Zitong Yu, Dewen Hu, Qing Liao, Qi Tian, Matti Pietikäinen, and Li Liu, “Pixel difference networks for efficient edge detection,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 5117–5127.
  • [9] Dor Verbin and Todd Zickler, “Field of junctions: Extracting boundary structure at low snr,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6869–6878.
  • [10] Rui Sun, Tao Lei, Qi Chen, Zexuan Wang, Xiaogang Du, Weiqiang Zhao, and Asoke K Nandi, “Survey of image edge detection,” Frontiers in Signal Processing, vol. 2, pp. 826967, 2022.
  • [11] FG Irwin et al., “An isotropic 3x3 image gradient operator,” Presentation at Stanford AI Project, vol. 1968, pp. 3, 2014.
  • [12] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik, “Contour detection and hierarchical image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 5, pp. 898–916, 2010.
  • [13] Jitendra Malik, Serge Belongie, Thomas Leung, and Jianbo Shi, “Contour and texture analysis for image segmentation,” International journal of computer vision, vol. 43, pp. 7–27, 2001.
  • [14] L Roberts, “Machine perception of 3-d solids, optical and electro-optical information processing,” 1965.
  • [15] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [16] Xin-Yi Gong, Hu Su, De Xu, Zheng-Tao Zhang, Fei Shen, and Hua-Bin Yang, “An overview of contour detection approaches,” International Journal of Automation and Computing, vol. 15, pp. 656–672, 2018.
  • [17] Xin Wang, “Laplacian operator-based edge detectors,” IEEE transactions on pattern analysis and machine intelligence, vol. 29, no. 5, pp. 886–890, 2007.
  • [18] Nati Ofir, Meirav Galun, Sharon Alpert, Achi Brandt, Boaz Nadler, and Ronen Basri, “On detection of faint edges in noisy images,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 4, pp. 894–908, 2019.
  • [19] William T Freeman, Edward H Adelson, et al., “The design and use of steerable filters,” IEEE Transactions on Pattern analysis and machine intelligence, vol. 13, no. 9, pp. 891–906, 1991.
  • [20] Pietro Perona, Jitendra Malik, et al., “Detecting and localizing edges composed of steps, peaks and roofs,” 1991.
  • [21] Y. Lu and R.C. Jain, “Reasoning about edges in scale space,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 14, no. 4, pp. 450–468, 1992.
  • [22] Claudia Nieuwenhuis, Eno Toeppe, Lena Gorelick, Olga Veksler, and Yuri Boykov, “Efficient squared curvature,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 4098–4105.
  • [23] Qiuxiang Zhong, Yutong Li, Yijie Yang, and Yu** Duan, “Minimizing discrete total curvature for image processing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9474–9482.
  • [24] Nati Ofir, Meirav Galun, Boaz Nadler, and Ronen Basri, “Fast detection of curved edges at low snr,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 213–221.
  • [25] Xavier Soria Poma, Edgar Riba, and Angel Sappa, “Dense extreme inception network: Towards a robust cnn model for edge detection,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 1923–1932.
  • [26] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
  • [27] Kun Huang, Yifan Wang, Zihan Zhou, Tianjiao Ding, Shenghua Gao, and Yi Ma, “Learning to parse wireframes in images of man-made environments,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 626–635.
  • [28] Nan Xue, Song Bai, Fudong Wang, Gui-Song Xia, Tianfu Wu, and Liangpei Zhang, “Learning attraction field representation for robust line segment detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1595–1603.
  • [29] Tony F Chan and Luminita A Vese, “Active contours without edges,” IEEE Transactions on image processing, vol. 10, no. 2, pp. 266–277, 2001.
  • [30] Zelun Wang and Jyh-Charn Liu, “Translating math formula images to latex sequences using deep neural networks with sequence-level training,” International Journal on Document Analysis and Recognition (IJDAR), vol. 24, no. 1-2, pp. 63–75, 2021.
  • [31] Qiaoqiao Ding, Yong Long, Xiaoqun Zhang, and Jeffrey A Fessler, “Modeling mixed poisson-gaussian noise in statistical image reconstruction for x-ray ct,” Arbor, vol. 1001, pp. 48109, 2016.
  • [32] Stanley H Chan, “What does a one-bit quanta image sensor offer?,” IEEE Transactions on Computational Imaging, vol. 8, pp. 770–783, 2022.
  • [33] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755.
  • [34] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus, “Indoor segmentation and support inference from rgbd images,” in Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12. Springer, 2012, pp. 746–760.
  • [35] Saurabh Gupta, Pablo Arbelaez, and Jitendra Malik, “Perceptual organization and recognition of indoor scenes from rgb-d images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 564–571.
  • [36] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

Supplemental Materials for CT-Bound: Robust Boundary Detection From Noisy Images Via Hybrid Convolution and Transformer Neural Networks

In this supplementary document, we first show some samples of our training data in Sec. S1. Then we report a quantitative comparison for color maps on BSDS500 [12] and NYUDv2 [34] datasets in Sec. S2. Finally we present more qualitative results from BSDS500 [12] and NYUDv2 [34] datasets in Sec. S3.

S1 Training Data

We use different datasets for the initialization and refinement stages. For the former stage, our model needs patches as inputs only. We generate noisy patches based on those randomly sampled from FoJ synthetic datasets [9], which contain basic shapes in grayscale. We assign RGB colors randomly and add noise according to (11). For the latter stage, we use noisy images generated from MS COCO [33] with the same noise model. The ground truth boundary maps are generated by running FoJ [9] with the ground truth color maps. Some training pair samples are shown in Fig. S1.

Refer to caption
Figure S1: Training pair (noisy image, ground truth boundary map, ground truth color map) samples for refinement stage training. The size of each image is 147×147147147147\times 147147 × 147. The photon level for each image is randomly selected in the range of [2,10]210[2,10][ 2 , 10 ].

S2 Quantitative Results of Color Maps

Since different photon levels cause different distribution parameters of pixel values, we normalize the pixel values before calculating the metrics for color maps. Specifically, a color map is normalized through:

P(x,y)=P(x,y)α,superscript𝑃𝑥𝑦𝑃𝑥𝑦𝛼\displaystyle P^{\prime}(x,y)=\frac{P(x,y)}{\alpha},italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_y ) = divide start_ARG italic_P ( italic_x , italic_y ) end_ARG start_ARG italic_α end_ARG , (S1)

where P(x,y)𝑃𝑥𝑦P(x,y)italic_P ( italic_x , italic_y ) and α𝛼\alphaitalic_α is from (11). Then the structural similarity index measure (SSIM), peak signal-to-noise ratio (PSNR), and mean squared error (MSE) are calculated between P(x,y)superscript𝑃𝑥𝑦P^{\prime}(x,y)italic_P start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_x , italic_y ) and P(x,y)superscript𝑃𝑥𝑦P^{*}(x,y)italic_P start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x , italic_y ). The quantitative comparison is shown in Tab. S1.

TABLE S1: Quantitative comparison of color maps on noisy images synthesized from BSDS500 [12] and NYUDv2 [34] datasets.
Photon level αtestsubscript𝛼test\alpha_{\text{test}}italic_α start_POSTSUBSCRIPT test end_POSTSUBSCRIPT Model BSDS500 NYUDv2
SSIM\uparrow PSNR(dB)\uparrow MSE(×102absentsuperscript102\times 10^{-2}× 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT)\downarrow SSIM\uparrow PSNR(dB)\uparrow MSE(×102absentsuperscript102\times 10^{-2}× 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT)\downarrow
2 FoJ [9] 0.338 10.712 8.626 0.433 10.657 8.954
Restormer [4] 0.152 8.492 14.624 0.157 8.567 14.312
Boundary Attention [1] 0.301 10.952 8.982 0.355 10.791 9.284
Ours 0.332 10.724 8.606 0.467 10.709 8.870
4 FoJ [9] 0.441 17.212 2.012 0.572 17.619 1.913
Restormer [4] 0.268 14.918 3.431 0.222 14.920 3.478
Boundary Attention [1] 0.452 17.785 2.031 0.560 18.103 1.926
Ours 0.429 17.198 2.024 0.587 17.740 1.885
6 FoJ [9] 0.484 19.780 1.176 0.632 20.799 0.951
Restormer [4] 0.357 17.496 2.042 0.289 17.800 1.903
Boundary Attention [1] 0.510 20.542 1.130 0.641 21.633 0.908
Ours 0.468 19.688 1.209 0.637 20.918 0.947
8 FoJ [9] 0.507 20.900 0.951 0.667 22.427 0.669
Restormer [4] 0.430 19.258 1.406 0.353 19.706 1.289
Boundary Attention [1] 0.541 21.768 0.872 0.684 23.500 0.604
Ours 0.488 20.726 0.996 0.666 22.473 0.679

S3 Additional Qualitative Results

In this section, we show more qualitative comparison results in various photon levels αtestsubscript𝛼test\alpha_{\text{test}}italic_α start_POSTSUBSCRIPT test end_POSTSUBSCRIPT, including color maps, in Fig. S2 and Fig. S3. The proposed method produces more crisp boundaries than some other models, which even have higher ODS F1-scores in Tab. III.

Refer to caption
Figure S2: Additional qualitative comparison of noisy images synthesized from BSDS500 [12] dataset.
Refer to caption
Figure S3: Additional qualitative comparison of noisy images synthesized from NYUDv2 [34] dataset.