CT-Bound: Robust Boundary Detection From Noisy Images Via Hybrid Convolution and Transformer Neural Networks
Abstract
We present CT-Bound, a robust and fast boundary detection method for very noisy images using a hybrid Convolution and Transformer neural network. The proposed architecture decomposes boundary estimation into two tasks: local detection and global regularization. During the local detection, the model uses a convolutional architecture to predict the boundary structure of each image patch in the form of a pre-defined local boundary representation, the field-of-junctions (FoJ) [9]. Then, it uses a feed-forward transformer architecture to globally refine the boundary structures of each patch to generate an edge map and a smoothed color map simultaneously. Our quantitative analysis shows that CT-Bound outperforms the previous best algorithms in edge detection on very noisy images. It also increases the edge detection accuracy of FoJ-based methods while having a 3-time speed improvement. Finally, we demonstrate that CT-Bound can produce boundary and color maps on real captured images without extra fine-tuning and real-time boundary map and color map videos at ten frames per second.
Index Terms:
boundary estimation, image denoising, convolutional neural network, transformerfigureBoundary detection from noisy images. Compared to a variety of models [1, 2, 3, 4, 5, 6, 7, 8, 9], ours robustly detects the boundaries even when they are visually challenging to discriminate.
I Introduction
Detecting boundary structures from very noisy images is a common and challenging computer vision problem [10]. There have been many applications that require boundary detection from images with very low light levels, such as medical imaging, manufacturing, autonomous navigation, etc. Although image boundary detection has been broadly studied since the early stage of computer vision [11, 12, 13, 6, 14], our results show that the accuracy of current best boundary detection algorithms are still unsatisfactory when the input images have a very low light level (Fig. CT-Bound: Robust Boundary Detection From Noisy Images Via Hybrid Convolution and Transformer Neural Networks).
We present CT-Bound. It is a deep neural network architecture that can robustly detect boundaries from a single noisy image. The model processes the input image to predict a generalized local boundary representation for each image patch called the field-of-junctions (FoJs) [9]. FoJ can represent a variety of boundary types in image patches, including edges, corners, and contours, and is an effective prior in edge detection, especially for noisy images [9, 1]. By constraining the predicted boundary structures to those that FoJ can describe, we observe our model can detect very faint edge signals in the presence of significant noise (Fig. CT-Bound: Robust Boundary Detection From Noisy Images Via Hybrid Convolution and Transformer Neural Networks). Our experiment shows that CT-Bound achieves the highest boundary detection accuracy among a variety of recent edge detection methods.
CT-Bound consists of an innovative two-stage, hybrid Convolution and Transformer neural network architecture (Fig. 2). The first stage is a convolutional architecture that makes an initial prediction of a local FoJ parameterization solely based on the visual appearance of each image patch. The second stage consists of a feedforward transformer encoder that takes in the initial FoJ estimation of all image patches to perform refinement. The architecture is novel as it completely decomposes boundary estimation into two tasks: detecting boundaries from local image patch and regularizing neighboring boundary estimations to ensure consistency and to look like natural boundaries. The convolutional network stage conducts boundary detection using a small receptive field ( in our experiments). Thus, it does not need to learn the global appearance of images and can be trained using synthetic image patches that only contain basic boundary structures. The transformer stage only receives the FoJ representation and has no access to the input image during inference. Therefore, the computational complexity of the transformer in the CT-Bound is significantly lower than that of the classic Vision Transformer [15].
Besides verifying the accuracy and robustness of the proposed algorithm using noisy images simulated from standard, benchmark datasets under different noise levels, we also test CT-Bound using noisy images captured using real-world cameras to generate real-time videos of boundary maps. The contribution of this paper can be summarized as follows.
-
•
A two-stage, hybrid neural network architecture;
-
•
A robust and fast and non-iterative solver of FoJ that enables real-time boundary detection on very noisy images;
-
•
A thorough experimental study that demonstrates CT-Bound achieves the highest or among the highest accuracy in detecting image boundaries from very noisy images compared to the previous best algorithms.
The code, the training data, the testing data, additional results, and the video demonstration of the proposed method can be found at https://github.com/guo-research-group/CT-Bound.
II Related Work
According to the classification by Gong et al. [16], image boundary detection methods can be divided into four categories, i.e., boundaries from luminance changes, texture changes, perceptual grou**, and illusory contour. We focus our literature review on the first category to which the proposed method belongs.
Local boundary detection
The first step of boundary detection is typically using specially designed filters to locate local responses of image boundaries. The filters either maximize the detectability and localization accuracy of the boundaries under noise, such as the Roberts cross operator [14], the Canny detectors [6], the Laplacian detectors [17], and the perfect matching filters [18] or are sensitive to the direction of the image boundaries, e.g., Sobel filters [11], Gaussian quadrature pairs [13], Steerable filters [19], etc. To robustly detect boundaries under non-ideal edges, people also develop sophisticated filters that are non-linear [20] or operate in multiple scales [21]. However, these classic methods based on local patches are found to be insufficient when the noise of the image is severe and have limited visual information to confidently determine the edges from the receptive field of a single filter. In this case, image smoothing is also not a solution as the smoothing operation will make faint or fine boundary structures to be indistinguishable [18].
Global boundary refinement
Boundaries in natural images are usually piece-wise smooth. Based on this observation, people have developed algorithms to refine the global boundary map from the locally detected boundaries by regularizing the curvature, for example, the squared curvature [22] or total curvature [23] along the boundary. Another observation for global boundary refinement is that the neighboring boundary maps must agree. There have been methods that use this intuition by enforcing neighboring consistency [9, 24, 12]. These global refinement methods are typically iterative and thus could have a high computational complexity.
Deep learning boundary detection
The emergence of deep learning has enabled people to develop deep neural networks that fuse the two steps into an end-to-end architecture learned from data. These methods directly output global boundary maps in nonparametric ways [3, 2, 7, 8, 5, 25, 26] or parametric ways [27, 28]. These methods typically outperform traditional, non-learning-based edge detection methods nowadays. There is a recent work, Boundary Attention, that also combines FoJ representation with deep neural networks [1]. It is targeted for detecting complicated, fine boundary structures and outputting more in-depth boundary information, such as edge-aware distance maps, from noisy images. Compared with Boundary Attention, CT-Bound focuses on boundary detection and achieves a higher boundary detection accuracy on benchmark datasets with a 3-time faster speed. Nonetheless, we suggest readers also read the paper [1] for a more comprehensive understanding of this field.
III Methods
We first briefly describe the FoJ representation [9] that we adopt in this work. Given an image patch with dimension and color channel, the FoJ models its boundary structure using a parameter set , where indicates the center of the vertex, represents the angles of the edges, are the color of the region between every pair of neighboring edges. The parameter is a hyperparameter that needs to be predetermined. See Fig. 1 for an exemplar illustration of FoJ. As shown in Verbin and Zickler, FoJ can represent a variety of local boundary structures, including edges, corners, and junctions [9]. Given an FoJ representation , the corresponding boundary map of the patch can be plotted via:
(1) |
where is the distance from the pixel to the edge and is the derivative of Heaviside function in [29]:
(2) |
where is a smoothing parameter that we set throughout our experiments. The color map can be visualized using:
(3) |
where when the pixel is within the wedge between edge and in the patch :
(4) | ||||
and otherwise.
III-A Network architecture
The network architecture of the proposed method is visualized in Fig. 2. Given a noisy image , CT-Bound first divides the image into overlap** patches. The initialization stage takes in each image patch into a CNN to generate the initial vertex location and the edge angles of the FoJ representation . Then, the method determines the color parameters mathematically by averaging the color of pixels of each divided area of the patch:
(5) |
where indicates the set of pixels in the wedge between edges and in the patch , as defined in (4).
The refinement stage takes in the initial FoJ representation of all patches simultaneously. It first converts each initial FoJ representation into a feature vector representation , and applies positional encoding by adding a positional vector to the feature vector to incorporate the positional information of each image patch. The 2D positional encoding vector follows the design of Zhang and Liu [30]:
where the dimension of each feature vector is an even number. All positional encoded feature vectors are fed into a transformer encoder consisting of a series of multi-head attention layers to refine the boundary consistency among patches globally and adjust unnatural boundary estimations. It is the only block in the framework that globally shares the per-patch FoJ information. Finally, the framework outputs the refined vertex location and edge angles of all patches , and calculate the refined color parameters using (5). We list the network hyperparameters we use in our experiment in Tab. I. As the transformer does not operate on the image domain, the dimension of the input vector and the number of layers are much smaller compared to the classic Vision Transformer [15].
Convolutional Neural Network | ||
Layer | Specification | Output |
Conv2d | 55 kernel, 4 stride, 2 pad | (21, 21, 96) |
MaxPool2d | 33 kernel, 2 stride, 0 pad | (10, 10, 96) |
Conv2d | 55 kernel, 1 stride, 2 pad | (10, 10, 256) |
MaxPool2d | 22 kernel, 2 stride, 0 pad | (5, 5, 256) |
Conv2d | 33 kernel, 1 stride, 1 pad | (5, 5, 384) |
Conv2d | 33 kernel, 1 stride, 1 pad | (5, 5, 384) |
Conv2d | 33 kernel, 1 stride, 1 pad | (5, 5, 256) |
MaxPool2d | 33 kernel, 2 stride, 0 pad | (2, 2, 256) |
FC | - | 4096 |
FC | - | 1024 |
FC | - | 5 |
Transformer Encoder | ||
Specification | Parameter | |
Dimension of each input vector | 128 | |
Number of layers | 8 | |
Number of heads in each layer | 8 | |
Dimension of the feed-forward layer | 256 |
From the refined FoJ parameters , CT-Bound generates the per-patch boundary and color map according to (1) and (3), and computes the global boundary map by averaging the per-patch boundary maps [9]:
(6) |
and the global color map via a specific smoothing operation over the per-patch color maps:
(7) |
where is the set of the patch indices that contain , is a binary indicator that is if pixel belongs to the wedge and otherwise, and is the refined color of wedge .
III-B Loss functions
We develop a multi-stage training scheme to optimize the parameters of CT-Bound. First, we train the initialization stage using the patch reconstruction loss:
(8) |
where denotes the expectation over all patches in the training set, and and indicate the per-patch color maps reconstructed using true and estimated FoJ parameters, respectively. We observe that the visual quality of the FoJ estimation is higher when using the loss in (8) for training than directly supervising the FoJ parameters. Furthermore, because the CNN in the initialization stage has a small receptive field, we can use synthetic image patches of basic shapes to train it, and we observe that the trained model can be generalized to real-world image patches without further fine-tuning.
When optimizing parameters of the refinement stage, we use a fixed, pre-trained initialization stage to generate inputs . We invent a two-step training process for optimizing the refinement stage, which was noticed to lead to a more stable and faster convergence. In the first step, a mean squared error loss function is adopted to supervise the estimated FoJ parameters directly:
(9) |
In the second step, we use a comprehensive image reconstruction loss adapted from Verbin and Zickler [9]:
(10) |
where indicates the expectation over all images in the dataset, and , , and are patch, boundary, and color loss terms, respectively:
In Verbin and Zickler, the loss in (10) was solved in an alternating, two-step fashion to refine the FoJ representation iteratively [9]. We evaluate the loss in a single step and show it can successfully fine-tune the feed-forward transformer encoder to improve the FoJ representation.
IV Experimental Results
IV-A Data processing
For the training of the initialization stage, we use randomly sampled patches from FoJ synthetic datasets [9] that only contain images of basic shapes such as squares. We select image patches for training and for testing. To simulate image noise, we apply a Poisson-Gaussian process to each image patch [31]:
(11) |
where and are the noisy and normalized clean image patches, is the photon level parameter that controls the noise of the image, and is the standard deviation of read noise ( is applied as in [32]). For the refinement stage, we use images from MS COCO [33] for training and testing. The training and testing sets contain and randomly selected, non-overlap** images, respectively. Each image is cropped at the center to the size of and is applied with the same noise as described in (11). In our experiment, we randomly set the photon level within the range to generate images with a variety of noise levels.
Photon level | Dataset | |||
---|---|---|---|---|
BSDS500 [12] | NYUDv2 [34] | |||
2 | 0.482 | 0.541 | 0.479 | 0.552 |
4 | 0.509 | 0.627 | 0.522 | 0.633 |
6 | 0.518 | 0.640 | 0.538 | 0.646 |
8 | 0.524 | 0.633 | 0.546 | 0.647 |
We evaluate CT-Bound on the testing sets of Berkeley Segmentation Data Set 500 (BSDS500) [12] and NYU Depth Dataset V2 (NYUDv2) [34]. We crop images to size and add noise as above. BSDS500 has 200 testing images. For NYUDv2, 200 images are randomly selected from its testing set split and adopted by in [35]. Using different datasets for evaluation demonstrates the generalizability of our model.
IV-B Implementation details
We use in our implementation. All optimizations in this work use the Adam optimizer [36]. The initialization stage is trained with an initial learning rate of and a decay of every epochs. The batch size is , and the total number of training epochs is . We use a two-step scheme to train the refinement stage as described in Sec. III-B. Both steps use a batch size . The first step uses (9) as the objective function and has epochs. a learning rate . The second step switches to (10) as its loss function and runs epochs. The learning rate for the second step is updated with a triangular cycle between and . The training and testing are performed on a machine with an NVIDIA GeForce RTX A5000 graphics card and 24 GB memory.
IV-C Ablation study
Model | Publication’Year | BSDS500 [12] | NYUDv2 [34] | FPS | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Photon level | §OpenCV-CPU | |||||||||
2 | 4 | 6 | 8 | 2 | 4 | 6 | 8 | †PyTorch-GPU ‡JAX-GPU | ||
Canny [6] | PAMI’86 | 0.493 | 0.489 | 0.383 | 0.495 | 0.476 | 0.467 | 0.325 | 0.482 | 625§ |
HED [5] | ICCV’15 | 0.327 | 0.388 | 0.456 | 0.520 | 0.307 | 0.363 | 0.434 | 0.484 | 18.8§ |
FoJ [9] | ICCV’21 | 0.509 | 0.564 | 0.597 | 0.611 | 0.529 | 0.576 | 0.597 | 0.613 | 1/68† |
PiDiNet [8] | ICCV’21 | 0.316 | 0.356 | 0.447 | 0.480 | 0.274 | 0.311 | 0.409 | 0.456 | 125† |
EDTER [7] | CVPR’22 | 0.484 | 0.509 | 0.545 | 0.594 | 0.488 | 0.478 | 0.534 | 0.581 | 15.6† |
Restormer [4]Canny [6] | CVPR’22+PAMI’86 | 0.490 | 0.521 | 0.518 | 0.516 | 0.511 | 0.533 | 0.503 | 0.499 | 9.6† |
Restormer [4]HED [5] | CVPR’22+ICCV’15 | 0.459 | 0.628 | 0.674 | 0.707 | 0.474 | 0.625 | 0.647 | 0.663 | 6.4† |
UAED [3] | CVPR’23 | 0.473 | 0.553 | 0.618 | 0.665 | 0.452 | 0.557 | 0.620 | 0.652 | 28.7† |
PEdger [2] | ACMMM’23 | 0.525 | 0.611 | 0.657 | 0.684 | 0.509 | 0.606 | 0.632 | 0.650 | 14.4† |
Boundary Attention [1] | arXiv’24 | 0.534 | 0.591 | 0.607 | 0.615 | 0.572 | 0.609 | 0.624 | 0.625 | 4.3‡ |
Ours | - | 0.541 | 0.627 | 0.640 | 0.633 | 0.552 | 0.633 | 0.646 | 0.647 | 11.4†, 15.2‡ |
We analyze the benefit offered by the refinement stage of CT-Bound. As shown in Fig. 3, the refinement stage attenuates noisy and inconsistent boundary estimations and strengthens real boundaries compared to the boundary map from the initialization stage. It also makes the color map appear smoother and sharper at color boundaries. The quantitative analysis is shown in Tab. II. It draws the same conclusion: the refinement stage increases the ODS F1-score of the boundary map compared to the initialization stage.
IV-D Analysis on synthetic and real images
We compare the proposed method with the iterative FoJ solver [9], the traditional edge detector Canny [6], and other learning-based models, including Boundary Attention [1], PEdger [2], UAED [3], Restormer [4]HED [5], Restormer [4]Canny [6], EDTER [7], PiDiNet [8], and HED [5]. Tab. III shows the quantitative comparison of these methods. Note that our model is trained using images from MS COCO dataset [33] with random photon level and evaluated using images from BSDS500 [12] and NYUDv2 [34] datasets on a specific photon level . The proposed approach achieves the highest or near-highest ODS F1-score when the noise level is very high, i.e. . Fig. 4 shows sample boundary maps estimated from noisy images synthesized from BSDS500 [12] and NYUDv2 [34] datasets. Both results indicate the robustness and generalizability of the proposed method to different image datasets and noise levels.
Fig. 5 shows the boundary maps and color maps estimated from a real image captured by an iPhone 13 Mini camera with a high shutter speed. The boundary map and color map generated from the proposed method have high visual quality without any fine-tuning to the real images. We also upload a video of boundary maps and color maps of a real captured video clip generated by CT-Bound to the URL listed in Sec. I.
V Conclusion and Limitation
In this paper, we propose a two-stage boundary detector, CT-Bound, which is a hybrid neural network architecture aiming to achieve robust and accurate boundary detection on extremely noisy images in a single shot. Compared to a variety of models, our method demonstrates the highest or near-highest boundary detection accuracy on benchmark datasets, producing visually clean and crisp boundaries. A limitation we observe is that, when processing videos, there are abrupt changes in detected boundaries across frames. This is because CT-Bound only uses a single frame for processing. The problem can be resolved by introducing temporal consistencies in the model.
References
- [1] Mia Gaia Polansky, Charles Herrmann, Junhwa Hur, Deqing Sun, Dor Verbin, and Todd Zickler, “Boundary attention: Learning to find faint boundaries at any resolution,” arXiv preprint arXiv:2401.00935, 2023.
- [2] Yuanbin Fu and Xiaojie Guo, “Practical edge detection via robust collaborative learning,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 2526–2534.
- [3] Caixia Zhou, Ya** Huang, Mengyang Pu, Qingji Guan, Li Huang, and Haibin Ling, “The treasure beneath multiple annotations: An uncertainty-aware edge detector,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15507–15517.
- [4] Syed Waqas Zamir, Aditya Arora, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang, “Restormer: Efficient transformer for high-resolution image restoration,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 5728–5739.
- [5] Saining Xie and Zhuowen Tu, “Holistically-nested edge detection,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1395–1403.
- [6] John Canny, “A computational approach to edge detection,” IEEE Transactions on pattern analysis and machine intelligence, , no. 6, pp. 679–698, 1986.
- [7] Mengyang Pu, Ya** Huang, Yuming Liu, Qingji Guan, and Haibin Ling, “Edter: Edge detection with transformer,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1402–1412.
- [8] Zhuo Su, Wenzhe Liu, Zitong Yu, Dewen Hu, Qing Liao, Qi Tian, Matti Pietikäinen, and Li Liu, “Pixel difference networks for efficient edge detection,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 5117–5127.
- [9] Dor Verbin and Todd Zickler, “Field of junctions: Extracting boundary structure at low snr,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 6869–6878.
- [10] Rui Sun, Tao Lei, Qi Chen, Zexuan Wang, Xiaogang Du, Weiqiang Zhao, and Asoke K Nandi, “Survey of image edge detection,” Frontiers in Signal Processing, vol. 2, pp. 826967, 2022.
- [11] FG Irwin et al., “An isotropic 3x3 image gradient operator,” Presentation at Stanford AI Project, vol. 1968, pp. 3, 2014.
- [12] Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik, “Contour detection and hierarchical image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 5, pp. 898–916, 2010.
- [13] Jitendra Malik, Serge Belongie, Thomas Leung, and Jianbo Shi, “Contour and texture analysis for image segmentation,” International journal of computer vision, vol. 43, pp. 7–27, 2001.
- [14] L Roberts, “Machine perception of 3-d solids, optical and electro-optical information processing,” 1965.
- [15] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- [16] Xin-Yi Gong, Hu Su, De Xu, Zheng-Tao Zhang, Fei Shen, and Hua-Bin Yang, “An overview of contour detection approaches,” International Journal of Automation and Computing, vol. 15, pp. 656–672, 2018.
- [17] Xin Wang, “Laplacian operator-based edge detectors,” IEEE transactions on pattern analysis and machine intelligence, vol. 29, no. 5, pp. 886–890, 2007.
- [18] Nati Ofir, Meirav Galun, Sharon Alpert, Achi Brandt, Boaz Nadler, and Ronen Basri, “On detection of faint edges in noisy images,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 4, pp. 894–908, 2019.
- [19] William T Freeman, Edward H Adelson, et al., “The design and use of steerable filters,” IEEE Transactions on Pattern analysis and machine intelligence, vol. 13, no. 9, pp. 891–906, 1991.
- [20] Pietro Perona, Jitendra Malik, et al., “Detecting and localizing edges composed of steps, peaks and roofs,” 1991.
- [21] Y. Lu and R.C. Jain, “Reasoning about edges in scale space,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 14, no. 4, pp. 450–468, 1992.
- [22] Claudia Nieuwenhuis, Eno Toeppe, Lena Gorelick, Olga Veksler, and Yuri Boykov, “Efficient squared curvature,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 4098–4105.
- [23] Qiuxiang Zhong, Yutong Li, Yijie Yang, and Yu** Duan, “Minimizing discrete total curvature for image processing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9474–9482.
- [24] Nati Ofir, Meirav Galun, Boaz Nadler, and Ronen Basri, “Fast detection of curved edges at low snr,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 213–221.
- [25] Xavier Soria Poma, Edgar Riba, and Angel Sappa, “Dense extreme inception network: Towards a robust cnn model for edge detection,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2020, pp. 1923–1932.
- [26] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
- [27] Kun Huang, Yifan Wang, Zihan Zhou, Tianjiao Ding, Shenghua Gao, and Yi Ma, “Learning to parse wireframes in images of man-made environments,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 626–635.
- [28] Nan Xue, Song Bai, Fudong Wang, Gui-Song Xia, Tianfu Wu, and Liangpei Zhang, “Learning attraction field representation for robust line segment detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1595–1603.
- [29] Tony F Chan and Luminita A Vese, “Active contours without edges,” IEEE Transactions on image processing, vol. 10, no. 2, pp. 266–277, 2001.
- [30] Zelun Wang and Jyh-Charn Liu, “Translating math formula images to latex sequences using deep neural networks with sequence-level training,” International Journal on Document Analysis and Recognition (IJDAR), vol. 24, no. 1-2, pp. 63–75, 2021.
- [31] Qiaoqiao Ding, Yong Long, Xiaoqun Zhang, and Jeffrey A Fessler, “Modeling mixed poisson-gaussian noise in statistical image reconstruction for x-ray ct,” Arbor, vol. 1001, pp. 48109, 2016.
- [32] Stanley H Chan, “What does a one-bit quanta image sensor offer?,” IEEE Transactions on Computational Imaging, vol. 8, pp. 770–783, 2022.
- [33] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755.
- [34] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus, “Indoor segmentation and support inference from rgbd images,” in Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12. Springer, 2012, pp. 746–760.
- [35] Saurabh Gupta, Pablo Arbelaez, and Jitendra Malik, “Perceptual organization and recognition of indoor scenes from rgb-d images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 564–571.
- [36] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
Supplemental Materials for CT-Bound: Robust Boundary Detection From Noisy Images Via Hybrid Convolution and Transformer Neural Networks
In this supplementary document, we first show some samples of our training data in Sec. S1. Then we report a quantitative comparison for color maps on BSDS500 [12] and NYUDv2 [34] datasets in Sec. S2. Finally we present more qualitative results from BSDS500 [12] and NYUDv2 [34] datasets in Sec. S3.
S1 Training Data
We use different datasets for the initialization and refinement stages. For the former stage, our model needs patches as inputs only. We generate noisy patches based on those randomly sampled from FoJ synthetic datasets [9], which contain basic shapes in grayscale. We assign RGB colors randomly and add noise according to (11). For the latter stage, we use noisy images generated from MS COCO [33] with the same noise model. The ground truth boundary maps are generated by running FoJ [9] with the ground truth color maps. Some training pair samples are shown in Fig. S1.
S2 Quantitative Results of Color Maps
Since different photon levels cause different distribution parameters of pixel values, we normalize the pixel values before calculating the metrics for color maps. Specifically, a color map is normalized through:
(S1) |
where and is from (11). Then the structural similarity index measure (SSIM), peak signal-to-noise ratio (PSNR), and mean squared error (MSE) are calculated between and . The quantitative comparison is shown in Tab. S1.
Photon level | Model | BSDS500 | NYUDv2 | ||||
---|---|---|---|---|---|---|---|
SSIM | PSNR(dB) | MSE() | SSIM | PSNR(dB) | MSE() | ||
2 | FoJ [9] | 0.338 | 10.712 | 8.626 | 0.433 | 10.657 | 8.954 |
Restormer [4] | 0.152 | 8.492 | 14.624 | 0.157 | 8.567 | 14.312 | |
Boundary Attention [1] | 0.301 | 10.952 | 8.982 | 0.355 | 10.791 | 9.284 | |
Ours | 0.332 | 10.724 | 8.606 | 0.467 | 10.709 | 8.870 | |
4 | FoJ [9] | 0.441 | 17.212 | 2.012 | 0.572 | 17.619 | 1.913 |
Restormer [4] | 0.268 | 14.918 | 3.431 | 0.222 | 14.920 | 3.478 | |
Boundary Attention [1] | 0.452 | 17.785 | 2.031 | 0.560 | 18.103 | 1.926 | |
Ours | 0.429 | 17.198 | 2.024 | 0.587 | 17.740 | 1.885 | |
6 | FoJ [9] | 0.484 | 19.780 | 1.176 | 0.632 | 20.799 | 0.951 |
Restormer [4] | 0.357 | 17.496 | 2.042 | 0.289 | 17.800 | 1.903 | |
Boundary Attention [1] | 0.510 | 20.542 | 1.130 | 0.641 | 21.633 | 0.908 | |
Ours | 0.468 | 19.688 | 1.209 | 0.637 | 20.918 | 0.947 | |
8 | FoJ [9] | 0.507 | 20.900 | 0.951 | 0.667 | 22.427 | 0.669 |
Restormer [4] | 0.430 | 19.258 | 1.406 | 0.353 | 19.706 | 1.289 | |
Boundary Attention [1] | 0.541 | 21.768 | 0.872 | 0.684 | 23.500 | 0.604 | |
Ours | 0.488 | 20.726 | 0.996 | 0.666 | 22.473 | 0.679 |