\cormark

[1]

GSO-YOLO: Global Stability Optimization YOLO for Construction Site Detection

Yuming Zhang    Dongzhi Guan    Shouxin Zhang    Junhao Su    Yunzhi Han    Jiabin Liu School of Civil Engineering, Southeast University, Nan**g 211189, China
Abstract

Safety issues at construction sites have long plagued the industry, posing risks to worker safety and causing economic damage due to potential hazards. With the advancement of artificial intelligence, particularly in the field of computer vision, the automation of safety monitoring on construction sites has emerged as a solution to this longstanding issue. Despite achieving impressive performance, advanced object detection methods like YOLOv8 still face challenges in handling the complex conditions found at construction sites. To solve these problems, this study presents the Global Stability Optimization YOLO (GSO-YOLO) model to address challenges in complex construction sites. The model integrates the Global Optimization Module (GOM) and Steady Capture Module (SCM) to enhance global contextual information capture and detection stability. The innovative AIoU loss function, which combines CIoU and EIoU, improves detection accuracy and efficiency. Experiments on datasets like SODA, MOCS, and CIS show that GSO-YOLO outperforms existing methods, achieving SOTA performance.

keywords:
Object Detection \sepDeep Learning \sepConstruction Site Monitoring \sepYOLOv8

1 Introduction

Safety accidents at construction sites not only pose severe threats to the lives of construction workers but also result in significant economic losses [1]. According to statistics, from 2015 to 2019, the residential and municipal engineering industry in China experienced 3,275 construction safety accidents, resulting in a total of 3,840 deaths, with the number of accidents and fatalities increasing at an average annual rate of 14.98% and 12.64%, respectively [2]. This indicates an increasingly severe trend in construction site safety accidents. Therefore, effectively enhancing safety management and engineering supervision at construction sites to reduce casualties and improve construction efficiency has become a critical issue that the construction industry urgently needs to address [3].

Research indicates that more than half of construction safety accidents can be timely prevented through enhanced monitoring of construction sites [4]. However, the commonly used monitoring methods are primarily traditional techniques such as work sampling and personnel testing [5], which not only consume substantial human, material, and financial resources but also suffer from inefficiency and the inability to provide comprehensive, around-the-clock monitoring [6]. With the development of information and communication technology, some scholars have begun to explore the use of computer vision technology to achieve automated monitoring of construction sites [7, 8, 9]. For example, Fang et al. [10] develop an integrated hybrid learning method based on Faster R-CNN and Deep CNN to detect the use of safety harnesses by high-altitude construction workers. Wang et al. [11] integrate computer vision technology with multispectral data for the detection of construction machinery. However, the current computer vision-based automated monitoring technologies for construction sites still face the following three urgent problems:

Refer to caption
Figure 1: Comparison between GSO-YOLO and YOLOv8.
  • Due to the complex backgrounds of construction sites, detection targets are frequently momentarily obstructed by obstacles such as construction vehicles. Consequently, the pixels captured by detection equipment often lack complete features. This results in a significant amount of noise, which severely interferes with the model’s ability to extract target features, thereby greatly reducing the accuracy of the detection model.

  • Since the detection targets are exposed to outdoor conditions, variations in lighting, weather, and resolution occur at different times. In low-light environments, the detection model struggles to identify effective feature information about the target objects, thereby reducing the reliability of the model’s recognition capability.

  • Given the large size of construction sites and the fact that detection equipment is often installed at wide angles far from the target areas, the detection model finds it challenging to extract feature information from distant targets. This increases the difficulty of accurate detection.

To address the aforementioned issues, this study proposes the Global Stability Optimization-YOLO (GSO-YOLO) model, as illustrated in Figure 2, which integrates the Global Optimization Module (GOM) and the Steady Capture Module (SCM) into the YOLOv8 model to handle the complexities of construction site environments. Additionally, this study has innovatively designed the AIoU loss function. Specifically, the GOM maximizes the capture of relevant information about detection targets within images by applying global attention weighting across input sequences and combining it with local information, thereby enhancing model performance and generalization capabilities. The SCM, on the other hand, smooths detection results and reduces noise interference by dynamically applying an exponentially weighted moving average to historical detection results, thereby improving the model’s stability and robustness. The combination of these two modules increases the receptive field of the network and enhances detection accuracy. Furthermore, by combining the advantages of the CIoU [44] and EIoU [45] loss functions, this study introduces a novel enhanced loss function, AIoU. With the deployment of GOM, SCM, and AIoU, GSO-YOLO effectively mitigates the interference of complex environments, momentary occlusions, and lighting variations in construction site object detection. Extensive experiments conducted on datasets such as SODA [46], MOCS [47], and CIS [48] validate the effectiveness of the proposed method. Overall, contributions are as follows:

Refer to caption
Figure 2: The GSO-YOLO overall architecture.
  • This study presents a novel method called GSO-YOLO, which successfully addresses the shortcomings of existing YOLO series algorithms and effectively mitigates the interference caused by complex environments in construction site object detection. GSO-YOLO enables the network to focus on and learn critical information from the entire construction site globally, increasing the receptive field without being confined to local details. It also enhances the capture of historical information, providing stable tracking and recognition of targets.

  • The Global Optimization Module (GOM) and Steady Capture Module (SCM) designed in this study are general methods with excellent generalizability, making them applicable to various deep learning-based object detection frameworks.

  • GSO-YOLO outperforms existing mainstream methods on multiple public datasets, achieving state-of-the-art (SOTA) performance. This demonstrates its significant potential and reliability in practical applications, substantially advancing the technology of construction site object detection.

2 Related Works

Currently, leveraging computer vision technology to achieve automated monitoring of construction sites has garnered widespread attention in academia due to its potential to enhance project safety and productivity [12, 13, 14]. Researchers have primarily focused on various aspects of construction site automation, including progress monitoring [15, 16], quality management [17, 18, 19], activity recognition [20, 21], occupational health assessment [22, 23], worker safety inspections [24, 25, 26], automatic fire detection [27], collision risk prevention [28], equipment activity analysis [29], and smart glasses systems [30]. However, most of these studies rely on handcrafted features and sliding window approaches to traverse entire images, resulting in slow algorithm performance, insufficient accuracy, and a lack of model generalizability [31].

In contrast, deep learning convolutional neural networks (CNNs) have demonstrated superior performance in object detection tasks [32, 33]. CNN-based detectors utilize adaptive convolution for feature extraction, which enhances both the accuracy and speed of algorithms while making models more robust in object detection [4]. For example, some studies have applied CNNs for structural defect detection [34, 35], building energy consumption prediction [36], and scaffold deformation monitoring [37].

Although progress has been made, these studies face several challenges that limit their effectiveness in complex construction site environments. One significant issue is their reliance on low-level features to improve the detection of small objects, often at the expense of accurately detecting larger objects [38]. Extending feature levels using low-level features can enhance small object detection accuracy but can simultaneously reduce the accuracy for large object detection [39, 40].

In this regard, the YOLO series algorithms have demonstrated excellent performance. Notably, Hao et al. [5] propose a lightweight multi-object detection model based on YOLOv5s, achieving notable success in construction site monitoring. Similarly, Han et al. [38] develop the ghost-YOLOX model for nighttime construction site monitoring.

However, existing YOLO algorithms struggle with noise interference from momentary obstructions and lighting variations at construction sites, which diminishes their performance in these dynamic environments. Despite numerous improvements to the loss functions used in YOLO algorithms, there remains room for further enhancements to address these challenges comprehensively. These limitations underscore the necessity for more robust and versatile approaches in automated construction site monitoring. Thus, there is a need for YOLO series algorithms that balance detection accuracy across small, medium, and large objects.

3 Method

The stacking of deep convolutional layers consumes a significant amount of memory and computational resources, prompting the introduction of new modules incorporating attention mechanisms as an alternative approach. These modules not only enhance the learning of more distinctive features but can also be easily integrated into the backbone architecture of neural networks. Based on this, to better adapt the proposed network model to the requirements of construction site object detection, this study has incorporated attention mechanism modules into the backbone network of YOLOv8. This integration aims to improve the accuracy of object detection without merely increasing the network depth by adding more convolutional layers.

3.1 Global Optimization Module

In the Global Optimization Module (GOM), this study utilizes the Global Attention Mechanism (GAM)[42] to apply attention weighting across the global input sequence. This mechanism extracts global contextual information and integrates it with local information for computation. GAM simultaneously considers three-dimensional attention weights—channel, spatial width, and spatial height. Channel attention employs a 3D arrangement to maintain information across dimensions, using two layers of multi-layer perceptrons (MLP) to amplify spatial correlations across channels. To focus on spatial information, the GAM uses two convolutional layers for spatial information fusion and eliminates the pooling process to preserve feature maps.

Assuming the input feature map is F1C×H×Wsubscript𝐹1superscript𝐶𝐻𝑊F_{1}\in\mathbb{R}^{C\times H\times W}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_H × italic_W end_POSTSUPERSCRIPT, the intermediate state is F2subscript𝐹2F_{2}italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and the output feature map is F3subscript𝐹3F_{3}italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT. They can be represented as follows:

F2=Mc(F1)F1subscript𝐹2tensor-productsubscript𝑀𝑐subscript𝐹1subscript𝐹1F_{2}=M_{c}\left(F_{1}\right)\otimes F_{1}italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⊗ italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (1)
F3=Ms(F2)F2subscript𝐹3tensor-productsubscript𝑀𝑠subscript𝐹2subscript𝐹2F_{3}=M_{s}\left(F_{2}\right)\otimes F_{2}italic_F start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = italic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⊗ italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (2)

Assuming Mcsubscript𝑀𝑐M_{c}italic_M start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT and Mssubscript𝑀𝑠M_{s}italic_M start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT represent the channel and spatial attention maps, respectively, and tensor-product\otimes denotes element-wise multiplication.

The attention parameters of the GAM are updated through gradient descent, represented as follows:

θi=θi1r×θLsubscript𝜃𝑖subscript𝜃𝑖1𝑟subscript𝜃𝐿\theta_{i}=\theta_{i-1}-r\times\nabla_{\theta}Litalic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT - italic_r × ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_L (3)

Here, θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the attention weight parameter after the i𝑖iitalic_i-th iteration, θ(i1)subscript𝜃𝑖1\theta_{(i-1)}italic_θ start_POSTSUBSCRIPT ( italic_i - 1 ) end_POSTSUBSCRIPT is the attention weight parameter from the previous iteration, r𝑟ritalic_r is the learning rate, L𝐿Litalic_L is the loss function, and θLsubscript𝜃𝐿\nabla_{\theta}L∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_L is the gradient of the loss function concerning the attention weight parameter, which can be obtained through backpropagation.

This study’s objective is to enable the entire network model to mitigate information loss and enhance global dimension interaction features as much as possible with the assistance of the Global Optimization Module (GOM). By considering spatial interactions and supplementing cross-dimensional information, the model can focus on the most relevant parts of the sequence, thereby improving performance and generalization capabilities. In construction sites, there are typically various complex scenes and objects, such as building structures, workers, and machinery. To effectively identify and understand these objects, GOM can assist the object detection network in global optimization. Specifically, GOM enables the network to globally focus on and learn the key information of the entire construction site, expanding the receptive field without being limited to local details. This allows for a more comprehensive understanding and identification of various objects at the construction site. This capability to integrate semantic information and structural features enables the object detection model to more accurately reflect the real conditions of the construction site, providing robust technical support for construction management and safety monitoring.

3.2 Steady Capture Module

In the Steady Capture Module (SCM), this study employs the Exponential Moving Average (EMA) [43]. EMA is an attention mechanism used to enhance the performance of deep learning models by dynamically adjusting weights to emphasize the influence of recent data, thereby adapting to changes in the input data. Specifically, each input feature is assigned a weight that determines its importance in the output computation. These weights are typically generated by a softmax function, with the input being a learned score vector.

During training, as weights are passed into the Exponential Moving Average (EMA), these score vectors are dynamically adjusted. A neural network model with EMA will dynamically update the attention weights based on the feedback from the loss function, allowing it to better capture the relevance of the input data. Moreover, the EMA attention mechanism introduces a decay rate to control the degree of forgetfulness regarding historical information. The smaller the decay rate, the higher the model’s dependency on historical information, and vice versa. This decay rate typically ranges between 0 and 1.

In EMA, the formula for calculating attention weights can be expressed as follows:

wi(t)=esi(t)j=1Nesj(t)subscript𝑤𝑖𝑡superscript𝑒subscript𝑠𝑖𝑡superscriptsubscript𝑗1𝑁superscript𝑒subscript𝑠𝑗𝑡w_{i}(t)=\frac{e^{s_{i}(t)}}{\sum_{j=1}^{N}e^{s_{j}(t)}}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_ARG (4)

where wi(t)subscript𝑤𝑖𝑡w_{i}(t)italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t ) represents the attention weight of the i𝑖iitalic_i-th input, and sj(t)subscript𝑠𝑗𝑡s_{j}(t)italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) is the score of the i𝑖iitalic_i-th input calculated by the neural network. During training, sj(t)subscript𝑠𝑗𝑡s_{j}(t)italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) is dynamically adjusted. The update rule is expressed as:

sj(t)=(1α)×si(t1)+α×si(t)subscript𝑠𝑗𝑡1𝛼subscript𝑠𝑖𝑡1𝛼superscriptsubscript𝑠𝑖𝑡s_{j}(t)=(1-\alpha)\times s_{i}(t-1)+\alpha\times s_{i}^{\prime}(t)italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) = ( 1 - italic_α ) × italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_t - 1 ) + italic_α × italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) (5)

where α𝛼\alphaitalic_α represents the decay rate, which controls the retention of historical information, and si(t)superscriptsubscript𝑠𝑖𝑡s_{i}^{\prime}(t)italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_t ) is the current score at time t𝑡titalic_t.

In the task of object detection at construction sites, the complexity and dynamism of the environment can lead to challenges such as temporary occlusions and varying lighting conditions, making it difficult for detection models to accurately identify objects. To address this, this study employs the SCM to help the model better capture historical information and achieves stable tracking and recognition of targets.

In the SCM, the use of the EMA significantly impacts both the model’s performance and its stability during training. On one hand, EMA controls the retention of historical information through the decay rate α𝛼\alphaitalic_α. Historical information refers to the feature data from previously processed images or image sequences. Construction sites may contain various background elements, such as buildings and machinery, which, although not targets, assist the model in understanding the positions and shapes of current targets. However, excessive reliance on historical information might lead to an inaccurate understanding of the current scene. The decay rate α𝛼\alphaitalic_α serves as a balancing factor, controlling the model’s dependency on historical information, allowing it to retain past data to some extent while also adapting to new scenes in a timely manner.

On the other hand, construction site datasets often contain noise and uncertainties, such as image blurring or distortion caused by weather and lighting variations. In such cases, the model needs to maintain robustness, performing well even in complex environments. In the SCM, adjusting α𝛼\alphaitalic_α helps the model better handle these uncertainties, thereby enhancing its robustness. By fine-tuning α𝛼\alphaitalic_α, the model achieves an optimal balance between historical data retention and adaptability to new information, ensuring consistent and accurate object detection in construction site environments.

3.3 Augmented Intersection over Union

In this research, a novel augmented loss function called AIoU is used, which integrates the CIoU [44] and EIoU [45] loss functions as its foundation. In YOLOv8, the regression loss function uses CIoU [44], and the CIoU loss function is expressed as follows:

CIoU =1IoU+ρ2(b,bgt)c2+αvsubscriptCIoU 1𝐼𝑜𝑈superscript𝜌2𝑏superscript𝑏𝑔𝑡superscript𝑐2𝛼𝑣\mathcal{L}_{\text{CIoU }}=1-IoU+\frac{\rho^{2}\left(b,b^{gt}\right)}{c^{2}}+\alpha vcaligraphic_L start_POSTSUBSCRIPT CIoU end_POSTSUBSCRIPT = 1 - italic_I italic_o italic_U + divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_b , italic_b start_POSTSUPERSCRIPT italic_g italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_α italic_v (6)

In this context, b𝑏bitalic_b and bgtsuperscript𝑏𝑔𝑡b^{gt}italic_b start_POSTSUPERSCRIPT italic_g italic_t end_POSTSUPERSCRIPT represent the center points of the predicted box and the target box, respectively. ρ()𝜌\rho(\cdot)italic_ρ ( ⋅ ) denotes the Euclidean distance, c𝑐citalic_c is the diagonal length of the smallest enclosing box that covers both boxes, α𝛼\alphaitalic_α is a positive trade-off parameter, and ν𝜈\nuitalic_ν is the consistency of the aspect ratio.

In the formula (6), ν𝜈\nuitalic_ν reflects the difference in aspect ratios rather than the actual differences in width and height with their respective confidences. Consequently, it sometimes hinders the model’s effective optimization of similarity. The penalty term in EIoU disaggregates the influence factors of the aspect ratio, calculating the length and width of the target box and the predicted box separately based on the penalty term in CIoU. The loss function includes three parts: overlap loss, center distance loss, and width-height loss. The first two parts follow the methods in CIoU, but the width-height loss directly minimizes the difference in width and height between the target box and the predicted box, resulting in faster convergence. The EIoU loss function is expressed as follows:

EIoU=1IoU+ρ2(b,bgt)cw+2ch2+ρ2(w,wgt)cw2+ρ2(b,bgt)ch2\mathcal{L}_{EIoU}=1-IoU+\frac{\rho^{2}\left(b,b^{gt}\right)}{c_{w}{}^{2}+c_{h% }{}^{2}}+\frac{\rho^{2}\left(w,w^{gt}\right)}{c_{w}{}^{2}}+\frac{\rho^{2}\left% (b,b^{gt}\right)}{c_{h}{}^{2}}caligraphic_L start_POSTSUBSCRIPT italic_E italic_I italic_o italic_U end_POSTSUBSCRIPT = 1 - italic_I italic_o italic_U + divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_b , italic_b start_POSTSUPERSCRIPT italic_g italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_c start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT + italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT end_ARG + divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_w , italic_w start_POSTSUPERSCRIPT italic_g italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_c start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT end_ARG + divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_b , italic_b start_POSTSUPERSCRIPT italic_g italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT end_ARG (7)

In the application of AIoU, the aspect ratio of the bounding box is first adjusted by CIoU until it converges to an appropriate range. Then, each edge is meticulously refined by EIoU until it converges to the correct value. The AIoU loss is computed using the following formula:

AIoU=1IoU+αv+ρ2(b,bgt)cw2+ch2+ρ2(w,wgt)cw2+ρ2(h,hgt)ch2subscript𝐴𝐼𝑜𝑈1𝐼𝑜𝑈𝛼𝑣superscript𝜌2𝑏superscript𝑏𝑔𝑡superscriptsubscript𝑐𝑤2superscriptsubscript𝑐2superscript𝜌2𝑤superscript𝑤𝑔𝑡superscriptsubscript𝑐𝑤2superscript𝜌2superscript𝑔𝑡superscriptsubscript𝑐2\mathcal{L}_{AIoU}=1-IoU+\alpha v+\frac{\rho^{2}\left(b,b^{gt}\right)}{c_{w}^{% 2}+c_{h}^{2}}+\frac{\rho^{2}\left(w,w^{gt}\right)}{c_{w}^{2}}+\frac{\rho^{2}% \left(h,h^{gt}\right)}{c_{h}^{2}}caligraphic_L start_POSTSUBSCRIPT italic_A italic_I italic_o italic_U end_POSTSUBSCRIPT = 1 - italic_I italic_o italic_U + italic_α italic_v + divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_b , italic_b start_POSTSUPERSCRIPT italic_g italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_c start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_w , italic_w start_POSTSUPERSCRIPT italic_g italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_c start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_h , italic_h start_POSTSUPERSCRIPT italic_g italic_t end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_c start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (8)

CIoU emphasizes the completeness of the intersection and union between target detection boxes, while EIoU focuses on computational efficiency. Combining the two allows for a comprehensive consideration of both detection accuracy and computational efficiency, enabling the model to maintain high accuracy while utilizing computational resources more effectively. In construction site environments, precise object location information is crucial for monitoring and safety management, and computational efficiency is also an important factor. The combination of CIoU and EIoU balances these needs: CIoU enhances sensitivity in detecting small objects, and EIoU increases the model’s robustness, allowing it to better adapt to complex environmental changes.

3.4 Overall Architecture Adjustment

This study uses YOLOv8 as the primary network for the experiments and has made several key improvements and adjustments to its structure. First, it introduces the GOM module before the SPPF module. This design decision is based on GOM’s ability to enhance the network’s perceptual field, allowing it to better understand the context and information of the entire image, thereby improving object detection accuracy and robustness. Second, it proposes the SCM module immediately after the SPPF. The role of this module is to further enhance object localization and capture based on the multi-scale features extracted by the SPPF, particularly for detecting small or low-contrast objects. Finally, it utilizes AIoU throughout the main network to propagate loss. AIoU is an adaptive IoU computation method that more accurately assesses detection accuracy and helps the network learn the precise boundaries of objects during training. This design effectively improves the model’s performance on construction site datasets, making it more suitable for real-world construction scenarios by enhancing the accuracy and robustness of object detection.

4 Experiment

4.1 Datasets

In this section, this study briefly introduces the three large-scale construction site datasets it uses: SODA [46], MOCS [47], and CIS [48].

The SODA dataset is an image dataset in VOC format, containing 19,846 images and annotations for 286,201 objects. SODA includes 15 categories: slogan, fence, hook, hopper, electric box, cutter, handcart, scaffold, brick, rebar, wood, board, helmet, vest, and person, covering the most common objects found on construction sites.

The MOCS dataset is a large-scale image dataset for detecting moving objects in construction sites. All images in the MOCS dataset are collected from real construction sites. The dataset comprises 41,668 images with annotations for 13 categories of objects. The categories in MOCS are worker, tower crane, hanging hook, vehicle crane, roller, bulldozer, excavator, truck, loader, pump truck, concrete transport mixer, pile driver, and other vehicles.

The CIS dataset is a novel image dataset designed to advance state-of-the-art instance segmentation in the field of construction management. It contains 50,000 images with 10 object categories, including two categories of workers: workers wearing and not wearing safety helmets; one category of materials: precast components (PCs); and seven categories of machines: PC delivery trucks, dump trucks, concrete mixer trucks, excavators, rollers, dozers, and wheel loaders.

4.2 Implement Detail

In the experiments, this study uses an RTX-3060 GPU as the experimental platform and selects YOLOv8 as the primary backbone network. It conducts experiments on three widely used construction site datasets: SODA, MOCS, and CIS. To ensure comparability and fairness in the experiments, it applies the same hyperparameter configuration to each dataset. Specifically, it uses the configuration shown in Table 1.

Table 1: Detailed experimental configuration.
Base learning rate 0.01
Base weight decay 0.0005
Optimizer momentum 0.937
Batch size 16
Training epochs 100
Warmup iterations max(1000, 3 * iters_per_epochs)
Input size 640*640

With the above strict configuration and selection criteria, these experiments can obtain reliable and trustworthy results, enabling the comparison of performance across different models and providing a solid foundation and reference for further research.

4.3 Comparison with the SOTA Results

1) Results on various image detection benchmarks: This study conducts experiments on the SODA, MOCS, and CIS datasets to evaluate the performance of GSO-YOLO. The experimental results are summarized in Table 2. From the results, it is evident that the proposed GSO-YOLO exhibits significant improvements in performance under various methodologies.

On the SODA dataset, the experimental results indicate that the proposed method, GSO-YOLO, demonstrates notable advantages over the baseline method when using YOLOv8 as the comparison baseline. At an IoU of 0.5, the mAP of GSO-YOLO increases from 70.13% to 81.54% compared to the baseline method. Furthermore, within the IoU range of 0.5 to 0.95, the mAP increases from 40.49% to 50.20%, signifying significant performance enhancements of 16% and 24%, respectively.

As the experiments extend to include other datasets, GSO-YOLO consistently outperforms the baseline method. Specifically, on the MOCS dataset, GSO-YOLO achieves significant improvements of 17% and 24% in mAP50 and mAP50-95, respectively, compared to the baseline method. Similarly, on the CIS dataset, GSO-YOLO also demonstrates slight enhancements, with performance improvements of 6% and 17% across different evaluation metrics.

These experimental results validate the significant effectiveness of GSO-YOLO in improving the performance of deep-learning models. GSO-YOLO proves its capability in substantially bridging existing performance gaps, highlighting its potential in enhancing the training process and optimizing the performance of deep learning models across various datasets.

Table 2: Comparison of YOLOv8 and GSO-YOLO on the construction site target detection datasets.
     Dataset      Method      mAP50      mAP50-95
     SODA      YOLOv8      70.13%      40.49%
     GSO-YOLO      81.54%(↑11.41%)      50.20%(↑9.71%)
     MOCS      YOLOv8      64.02%      46.24%
     GSO-YOLO      75.13%(↑11.11%)      57.47%(↑11.23%)
     CIS      YOLOv8      82.57%      63.56%
     GSO-YOLO      88.03%(↑5.46%)      74.20%(↑10.64%)

2) Training-Accuracy Curve Analysis: To illustrate the impact of the proposed method on training performance, this study conducts a comparative analysis between the baseline method and GSO-YOLO by using SODA. It visualizes the mAP-Epochs curves with mAP50, and the results are shown in Figure 3. GSO-YOLO consistently outperforms the baseline method.

Refer to caption
Figure 3: mAP-Epochs curves.

3) Classes-Accuracy Curve Analysis: To compare the detection performance of different datasets within the same class, this study conducts a correlation analysis with mAP50. As illustrated in Figure 4, it is evident that each class in every dataset performs better on GSO-YOLO compared to YOLOv8.

Refer to caption
((a))
Refer to caption
((b))
Refer to caption
((c))
Figure 4: mAP-classes curves. (a)mAP-classes curves on SODA. (b) mAP-classes curves on MOCS. (c) mAP-classes curves on CIS.

4.4 Ablation Study

1) Performance Comparison between YOLOv8 with GOM, SCM, and GSO-YOLO: To further elucidate the contributions of GOM and SCM to GSO-YOLO, this study conducts experiments using YOLOv8 as the baseline method and sequentially adds GOM, SCM, and GSO-YOLO to the baseline method on the three datasets. The experimental results are summarized in Table 3.

Table 3: Results of ablation experiments using different methods based on YOLOv8.
     Dataset      Method      mAP50      mAP50-95
     SODA      YOLOv8      70.13%      40.49%
     YOLOv8+SCM      72.06%(↑1.93%)      42.30%(↑1.81%)
     YOLOv8+GOM      78.31%(↑8.18%)      46.77%(↑6.28%)
     YOLOv8+SCM+GOM      78.78%(↑8.65%)      47.12%(↑6.63%)
     GSO-YOLO      81.54%(↑11.41%)      50.20%(↑9.71%)
     MOCS      YOLOv8      64.02%      46.24%
     YOLOv8+SCM      64.60%(↑0.58%)      46.61%(↑0.37%)
     YOLOv8+GOM      71.86%(↑7.84%)      53.53%(↑7.29%)
     YOLOv8+SCM+GOM      72.06%(↑8.04%)      53.77%(↑7.53%)
     GSO-YOLO      75.13%(↑11.11%)      57.47%(↑11.23%)
     CIS      YOLOv8      82.57%      63.56%
     YOLOv8+SCM      82.80%(↑0.23%)      63.57%(↑0.01%)
     YOLOv8+GOM      87.47%(↑4.90%)      70.96%(↑7.40%)
     YOLOv8+SCM+GOM      87.51%(↑4.94%)      71.03%(↑7.47%)
     GSO-YOLO      88.03%(↑5.46%)      74.20%(↑10.64%)

From the results in the table, it is evident that, for example, on the SODA dataset, without GOM and SCM, the mAP50 and mAP50-95 only reach 70.13% and 40.49%, respectively. When SCM is added alone, the accuracy increases to 72.06% and 42.30%. Similarly, with the addition of GOM alone, the accuracy shows a significant improvement, reaching 78.31% and 46.77%. When both modules are combined, the mAP further increases to 78.78% and 47.12%. Notably, when both modules work together and the original loss function is replaced with AIoU, the accuracy further improves to 81.54% and 50.20%.

These findings provide compelling evidence of the significant contributions made by GOM, SCM, and AIoU, highlighting the remarkable performance achieved by GSO-YOLO through their integration.

2) Comparison of Features in Different Methods: The research aims to demonstrate the significant contributions of the proposed GSO-YOLO to the field of construction detection and its competitive performance compared to YOLOv8. To showcase the advanced features of GSO-YOLO, this study conducts feature map analysis for Figure5 on different configurations, including YOLOv8 with SCM, YOLOv8 with GOM, both modules applied simultaneously, and GSO-YOLO. In the heatmaps, deeper red indicates higher attention concentration. A larger red area represents a broader receptive field of the model. The detailed results of these feature maps are shown in Figure 6. From left to right, the five feature maps represent five methods: YOLOv8, YOLOv8+SCM, YOLOv8+GOM, YOLOv8+SCM+GOM, and GSO-YOLO.

Refer to caption
Figure 5: Original image example.
Refer to caption
((a))
Refer to caption
((b))
Refer to caption
((c))
Refer to caption
((d))
Refer to caption
((e))
Figure 6: Visualization of channel feature maps. (a) Feature Map of YOLOv8. (b) Feature Map of YOLOv8 with SCM. (c) Feature Map of YOLOv8 with GOM. (d) Feature Map of YOLOv8 with SCM and GOM. (e) Feature Map of GSO-YOLO.
Refer to caption
((a))
Refer to caption
((b))
Refer to caption
((c))
Refer to caption
((d))
Refer to caption
((e))
Refer to caption
((f))
Figure 7: Generalization Experimental Performance. (a) Ground truth labels. (b) Predicted labels using YOLOv8. (c) Predicted labels using YOLOv8+SCM. (d) Predicted labels using YOLOv8+GOM. (e) Predicted labels using YOLOv8+SCM+GOM. (f) Predicted labels using GSO-YOLO.

When analyzing the feature maps, (a) concentrates on a very limited area, indicating that YOLOv8 has limited effectiveness in object detection for construction scenes. In (b), with the addition of SCM, more object features are successfully captured. Similarly, with the involvement of GOM in (c), more global semantic features are noted. Notably, in (d), with both modules combined, GSO-YOLO captures not only more features of the objects to be identified but also shows significant performance enhancement after global optimization. Finally, with the application of AIoU in (e), the model improves both the sensitivity to detecting small objects and robustness, allowing it to better adapt to complex environmental changes.

Therefore, GSO-YOLO systematically and strategically addresses the shortcomings of the original methods step by step.

4.5 Generalization Study

This section aims to investigate the generalization performance of the proposed GSO-YOLO and assess its effectiveness. This study trains the model using three datasets and tests it on their respective validation sets. The results, as shown in Figure 7, demonstrate the impact of adding different modules on generalization ability compared to the ground truth labels in (a). By introducing GSO-YOLO, a significant expansion of the model’s receptive field and a notable improvement in test accuracy can be observed, with the test precision for certain categories even reaching 1.0. These findings highlight the effectiveness of GSO-YOLO in enhancing the generalization capability of supervised learning, thereby improving its overall performance in construction scene object detection tasks.

5 Conclusion

This study develops the GSO-YOLO model for construction site monitoring integrated with GOM and SCM. GOM combines global attention weighting with local information to prevent information loss and enhance global interaction feature capture, improving the model’s performance and generalization. SCM uses dynamic exponential moving averages to process historical detection results, reducing noise, smoothing outputs, and boosting stability and robustness. Together, GOM and SCM comprehensively enhance network performance.

Additionally, the AIoU enhanced loss function, merging CIoU and EIoU, improves detection in complex construction environments. CIoU focuses on the completeness of intersection and union between detection boxes, while EIoU emphasizes computational efficiency. Thus, AIoU ensures accuracy and efficiency, allowing GSO-YOLO to maintain high accuracy and efficient resource use.

These improvements are validated on the SODA, MOCS, and CIS construction site datasets, achieving state-of-the-art (SOTA) performance and showing that GSO-YOLO significantly enhances target detection performance. In summary, GSO-YOLO effectively addresses complex construction site challenges, improving detection accuracy and stability.

References

  • [1] C. Melchior and R. R. Zanini, “Mortality per work accident: A literature map**,” Safety Science, vol. 114, pp. 72–78, 2019.
  • [2] “The mohurd’s notification on the safety accidents in housing and municipal works in 2019,” Standardization of Engineering Construction, pp. 51–53, 2020.
  • [3] B. H. W. Guo, Y. Zou, Y. Fang, Y. Goh, and P. X. W. Zou, “Computer vision technologies for safety science and management in construction: A critical review and future research directions,” Safety science, vol. 135, no. 1, 2021.
  • [4] X. Wang, H. Wang, C. Zhang, Q. He, and L. Huo, “A sample balance-based regression module for object detection in construction sites,” Applied Sciences, vol. 12, no. 13, p. 6752, 2022.
  • [5] F. Hao, T. Zhang, G. He, R. Dou, and C. Meng, “Casnli-yolo: construction site multi-target detection method based on improved yolov5s,” IOP Publishing Ltd, 2024.
  • [6] Y. Su and L. Liu, “Real-time tracking and analysis of construction operations,” in Proc., Construction Research Congress.   ASCE Grand Bahama Island, Bahamas, 2007.
  • [7] W. Fang, P. E. D. Love, H. Luo, and L. Ding, “Computer vision for behaviour-based safety in construction: A review and future directions,” Advanced engineering informatics, vol. 43, no. Jan., pp. 100 980.1–100 980.13, 2020.
  • [8] M. P. A. Envelope, D. Q. T. A. Envelope, J. B. B. Envelope, and S. P. A. P. Envelope, “Advanced wildfire detection using generative adversarial network-based augmented datasets and weakly supervised object localization,” International Journal of Applied Earth Observation and Geoinformation, vol. 114.
  • [9] D. Q. Tran, M. Park, D. Jung, and S. Park, “Damage-map estimation using uav images and deep learning algorithms for disaster management system,” Remote Sensing, vol. 12, no. 24, p. 4169, 2020.
  • [10] W. Fang, L. Ding, H. Luo, and P. E. D. Love, “Falls from heights: A computer vision-based approach for safety harness detection,” Automation in Construction, vol. 91, no. JUL., pp. 53–61, 2018.
  • [11] Y. Wang, X. Liu, Q. Zhao, H. He, and Z. Yao, “Target detection for construction machinery based on deep learning and multi-source data fusion,” IEEE Sensors Journal, 2023.
  • [12] I. Brilakis, H. Fathi, and A. Rashidi, “Progressive 3d reconstruction of infrastructure with videogrammetry,” Automation in Construction, vol. 20, no. 7, pp. 884–895, 2011.
  • [13] M.-W. Park, N. Elsafty, and Z. Zhu, “Hardhat-wearing detection for enhancing on-site safety of construction workers,” Journal of Construction Engineering and Management, vol. 141, no. 9, p. 04015024, 2015.
  • [14] Q. Fang, H. Li, X. Luo, L. Ding, H. Luo, T. M. Rose, and W. An, “Detecting non-hardhat-use by a deep learning method from far-field surveillance videos,” Automation in Construction, vol. 85, no. JAN., pp. 1–9, 2018.
  • [15] S. Roh, Z. Aziz, and F. Pena-Mora, “An object-based 3d walk-through model for interior construction progress monitoring,” Automation in Construction, vol. 20, no. 1, pp. 66–75, 2011.
  • [16] M. Golparvar-Fard, F. Pea-Mora, and S. Savarese, “Application of d4ar-a 4-dimensional augmented reality model for automating construction progress monitoring data collection,” Signal Processing Image Communication, vol. 14, pp. 129–153, 2009.
  • [17] Z. Zhu and I. Brilakis, “Concrete column recognition in images and videos,” Journal of Computing in Civil Engineering, vol. 24, no. 6, pp. 478–487, 2010.
  • [18] Y. J. Cha, W. Choi, and O. Büyüköztürk, “Deep learning-based crack damage detection using convolutional neural networks,” Computer-Aided Civil and Infrastructure Engineering, 2017.
  • [19] Tiantang, Chen, Yingying, Zhu, and Aixi, “Efficient crack detection method for tunnel lining surface cracks based on infrared images,” Journal of computing in civil engineering, 2017.
  • [20] Chen, Zhi, Ren, Xiaoning, Zhu, and Zhenhua, “Visual tracking of construction jobsite workforce and equipment with particle filtering,” Journal of Computing in Civil Engineering, 2016.
  • [21] J. Yang, R. Arif, R. Vela, R. Teizer, and R. Shi, “Tracking multiple workers on construction sites using video cameras,” Advanced Engineering Informatics, vol. 24, no. 4, pp. 428–434, 2010.
  • [22] Enrique, Valero, Aparajithan, Sivanathan, Frédéric, Bosché, Mohamed, and Abdel-Wahab, “Musculoskeletal disorders in construction: A review and a novel system for activity tracking with body area network,” Applied Ergonomics, 2016.
  • [23] J. O. Seo, R. Starbuck, S. U. Han, S. H. Lee, and T. J. Armstrong, “Motion data-driven biomechanical analysis during construction tasks on sites,” Journal of Computing in Civil Engineering, vol. 29, no. 4, pp. B4 014 005.1–B4 014 005.13, 2015.
  • [24] M.-W. Park, N. Elsafty, and Z. Zhu, “Hardhat-wearing detection for enhancing on-site safety of construction workers,” Journal of Construction Engineering and Management, vol. 141, no. 9, p. 04015024, 2015.
  • [25] Q. Fang, H. Li, X. Luo, L. Ding, H. Luo, T. M. Rose, and W. An, “Detecting non-hardhat-use by a deep learning method from far-field surveillance videos,” Automation in Construction, vol. 85, no. JAN., pp. 1–9, 2018.
  • [26] B. E. Mneymneh, M. Abbas, and H. Khoury, “Evaluation of computer vision techniques for automated hardhat detection in indoor construction safety applications,” Frontiers of Engineering Management, vol. 5, no. 002, pp. 227–239, 2018.
  • [27] M. Mukhiddinov, A. B. Abdusalomov, and J. Cho, “Automatic fire detection and notification system based on improved yolov4 for the blind and visually impaired,” Sensors, vol. 22, no. 9, p. 3307, 2022.
  • [28] D. Kim, M. Liu, S. Lee, and V. R. Kamat, “Remote proximity monitoring between mobile construction resources using camera-mounted uavs,” Automation in Construction, vol. 99, pp. 168–182, 2019.
  • [29] D. Roberts and M. Golparvar-Fard, “End-to-end vision-based detection, tracking and activity analysis of earthmoving equipment filmed at ground level,” Automation in Construction, vol. 105, pp. 102 811–, 2019.
  • [30] M. Mukhiddinov and J. Cho, “Smart glass system using deep learning for the blind and visually impaired,” Electronics, 2021.
  • [31] W. Fang, L. Ding, B. Zhong, P. E. D. Love, and H. Luo, “Automated detection of workers and heavy equipment on construction sites: A convolutional neural network approach,” ADVANCED ENGINEERING INFORMATICS, 2018.
  • [32] M. Park, J. Bak, S. Park et al., “Small and overlap** worker detection at construction sites,” Automation in Construction, vol. 151, p. 104856, 2023.
  • [33] J. Su, Z. Chen, C. He, D. Guan, C. Cai, T. Zhou, J. Wei, W. Tian, and Z. Xie, “Gsenet: Global semantic enhancement network for lane detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 13, 2024, pp. 15 108–15 116.
  • [34] Y. J. Cha, W. Choi, and O. Büyüköztürk, “Deep learning-based crack damage detection using convolutional neural networks,” Computer-Aided Civil and Infrastructure Engineering, 2017.
  • [35] C. Feng, M. Y. Liu, C. C. Kao, and T. Y. Lee, “Deep active learning for civil infrastructure defect detection and classification,” in ASCE International Workshop on Computing in Civil Engineering 2017American Society of Civil Engineers, 2017.
  • [36] S. Yan, Y. Zhang, H. Sun, and A. Wang, “A real-time operational carbon emission prediction method for the early design stage of residential units based on a convolutional neural network: A case study in bei**g, china,” Journal of Building Engineering, vol. 75, p. 106994, 2023.
  • [37] Y. Zhang, D. Guan, and J. Liu, “Angle acquisition of scaffold rod based on laser scanning point cloud,” Construction Technology, vol. 53, no. 2, pp. 103–109, 2024.
  • [38] H. Gui**, W. Ruixuan, X. WuYan, and L. Jun, “Night construction site detection based on ghost-yolox,” Connection Science, vol. 36, no. 1, p. 2316015, 2024.
  • [39] Y. Yan, “Towards efficient detection for small objects via attention-guided detection network and data augmentation,” Sensors, vol. 22, 2022.
  • [40] W. Zhou, X. Min, R. Hu, Y. Long, H. Luo et al., “Fasterx: Real-time object detection based on edge gpus for uav applications,” arXiv preprint arXiv:2209.03157, 2022.
  • [41] L. Chen, H. Zhang, J. Xiao, L. Nie, and T. S. Chua, “Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [42] Y. Liu, Z. Shao, and N. Hoffmann, “Global attention mechanism: Retain information to enhance channel-spatial interactions,” 2021.
  • [43] X. Li, Z. Zhong, J. Wu, Y. Yang, Z. Lin, and H. Liu, “Expectation-maximization attention networks for semantic segmentation,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9167–9176.
  • [44] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, “Distance-iou loss: Faster and better learning for bounding box regression,” arXiv, 2019.
  • [45] Z. Y.-F., Z. Zhang, Z. Jia, L. Wang, T. Tan, and W. Ren, “Focal and efficient iou loss for accurate bounding box regression,” Neurocomputing, 2022.
  • [46] R. Duan, H. Deng, M. Tian, Y. Deng, and J. Lin, “Soda: Site object detection dataset for deep learning in construction,” 2022.
  • [47] A. Xuehui, Z. Li, L. Zuguang, W. Chengzhi, L. Pengfei, and L. Zhiwei, “Dataset and benchmark for detecting moving objects in construction sites,” Automation in Construction, vol. 122, p. 103482, 2021.
  • [48] X. Yan, H. Zhang, Y. Wu, C. Lin, and S. Liu, “Construction instance segmentation (cis) dataset for deep learning-based computer vision,” Automation in Construction, vol. 156, p. 105083, 2023.