\MakePerPage

footnote

[1]\fnmGarima \surMalik

1] \orgnameToronto Metropolitan University, \cityToronto, \postcodeM5K 2B3, \stateOntario, \countryCanada

2] \orgnameIBM, \cityNorth Carolina, \countryUSA

Supervised Semantic Similarity-based Conflict Detection Algorithm: S3CDA

[email protected]    \fnmMucahit \surCevik    \fnmDevang \surParikh    \fnmAyse \surBasar [ [
Abstract

In the realm of software development, the clarity, completeness, and comprehensiveness of requirements significantly impact the success of software systems. The Software Requirement Specification (SRS) document, a cornerstone of the software development life cycle, delineates both functional and nonfunctional requirements, playing a pivotal role in ensuring the quality and timely delivery of software projects. However, the inherent natural language representation of these requirements poses challenges, leading to potential misinterpretations and conflicts. This study addresses the need for conflict identification within requirements by delving into their semantic compositions and contextual meanings. Our research introduces an automated supervised conflict detection method known as the Supervised Semantic Similarity-based Conflict Detection Algorithm (S3CDA). This algorithm comprises two phases: identifying conflict candidates through textual similarity and employing semantic analysis to filter these conflicts. The similarity-based conflict detection involves leveraging sentence embeddings and cosine similarity measures to identify pertinent candidate requirements. Additionally, we present an unsupervised conflict detection algorithm, UnSupCDA, combining key components of S3CDA, tailored for unlabeled software requirements. Generalizability of our methods is tested across five SRS documents from diverse domains. Our experimental results demonstrate the efficacy of the proposed conflict detection strategy, achieving high accuracy in automated conflict identification.

keywords:
Software Requirement Specifications, Conflict Detection, Sentence Similarity, Sentence Embeddings, Named Entity Recognition
articletype: Original Article

1 Introduction

Requirement Engineering (RE) is the process of defining, documenting, and maintaining the software requirements [1]. RE process involves four main activities, namely, requirements elicitation, requirements specification, requirements verification and validation, and requirements management. In the requirement specification process, the deliverable is termed as Software Requirement Specification (SRS) document which is highly important in Software Development Life Cycle (SDLC) [2]. SRS documents describe the functionality and expected performance for software products, naturally affecting all the subsequent phases in the process. The requirement set defined in SRS documents are analyzed and refined in the design phase, which results in various design documents. Then, the developers proceed with these documents to build the code for the software system [3].

SRS documents are mostly written in natural language to improve the comprehensibility of requirements. The success of any software system is largely dependent on the clarity, transparency, and comprehensibility of software requirements [4]. Conflicting and incomprehensible software requirements might lead to increased project completion times, inefficiency in software systems, and increase in the project budget. Detection of conflicts in the earlier development phase is very important, however, the manual identification of these conflicts could be tedious and time-consuming. It is necessary to develop semi-automated or automated approaches for conflict detection in SRS documents. Considering the structure of software requirements, Natural Language Processing (NLP) methods can help in analyzing and understanding the software requirements semantically. Various information extraction techniques such as Named Entity Recognition (NER), and Parts of Speech (POS) tagging can be used for this purpose, alongside the semantic similarity of the natural language text to interpret the context and syntactic nature of the software requirements.

In order to provide an automated approach for generalised conflict identification, we propose a supervised two-phase framework i.e., Supervised Semantic Similarity-based Conflict Detection Algorithm (S3CDA) which elicits the conflict criteria from the provided software requirements, and outputs the conflicting requirements. In the first phase, we convert the software requirements into high dimensional vectors using various sentence embeddings, and then identify the conflict candidates using cosine similarity. Then, in the second phase, candidate conflict set is further refined by measuring the overlap** entities in the requirement texts, with high level of overlaps pointing to a conflict. Furthermore, we formulate an unsupervised variant of our proposed algorithm, called as UnSupCDA. This algorithm seamlessly integrates the core elements of the S3CDA approach and is adept at handling unlabeled requirements and identify the conflicts.

Research Contribution

The main contributions of our study can be summarized as follows:

  • We introduce two novel conflict identification techniques, namely S3CDA and UnSupCDA, meticulously designed through a comprehensive analysis of software requirement structures. We assess the efficacy of these proposed methodologies through extensive numerical experiments conducted on five diverse SRS documents, providing valuable insights into their practical applicability and performance.

  • We make use of information extraction techniques in detecting the requirement conflicts. Specifically, we apply software-specific NER model to extract the key entities from the software requirements which can be useful in indicating the presence of conflicts. Additionally, we conduct a thorough analysis of the correlation between semantic similarity among requirements and the overlap of entities in requirement pairs across diverse datasets.

Structure of the Paper

The remainder of the paper is organized as follows. Section 2 provides the background on the problem of conflict identification in software requirement datasets. Section 3 introduces our proposed method for automated conflict detection, and provides a detailed discussion over dataset characteristics, sentence embeddings, and NER. In Section 4, we present the results from our experiments and discuss the applicability of our proposed approaches. Lastly, Section 6 provides concluding remarks and future research directions.

2 Background

Previous studies suggest the use of NLP-based techniques to solve various software requirement related problems such as requirement classification [5, 6], ambiguity detection [7], bug report classification [8], duplicate bug report prediction [9], conflict identification [10], and map** of natural language-based requirements into formal structures [11]. Conflict detection is one of the most difficult problems in requirement engineering [3]. Inability in identifying the conflicts in software requirements might lead to uncertainties and cost overrun in software development. Several papers discussed conflict identification in various domains, however, an autonomous, reliable and generalizable approach for detecting conflicting requirements is yet to be achieved. Below, we first introduce the basic definitions and then we review the conflict detection strategies for functional and non-functional requirements.

The terms ‘Ambiguity’ and ‘Conflict’ can be misconstrued in the requirement engineering context. Researchers have provided formal definitions for the requirement ambiguity as a requirement having more than one meaning, and provided various techniques to detect the requirement ambiguities in SRS documents [1, 12]. On the other hand, requirement conflict detection remains as a challenging problem, lacking a well-accepted formal definition and structure. Several studies define the requirement conflict depending upon the domain of requirements. However, the term ‘conflict’ can be defined more broadly as the presence of interference, interdependency, or inconsistency between requirements [13]. Kim et al. [14] proposed the definition of requirement conflict as interaction and dependencies present between requirements which results into negative or undesired operation of the software systems.

Butt et al. [15] defined requirement conflicts based on the categorization of requirements to mandatory, essential, and optional requirements. Kim et al. [14] described the requirement structure as Actor (Noun) + Action (verb) + Object (object) + Resource (resource). An activity conflict can arise when two requirements achieve the same actions through different object and a resource conflict may arise when different components try to share the same resources. Moser et al. [16] categorized the conflicts as simple (if exists between two requirements) and complex (if exists between three or more requirements). Recently, Guo et al. [10] proposed a comprehensive definition for semantic conflicts amongst different functional requirements. They stated that if two requirements having inferential, interdependent, and inclusive relationship then it may lead to inconsistent behaviour in software system.

In our work, the conflicts are defined based on the premise that if the implementation of risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and rjsubscript𝑟𝑗r_{j}italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT cannot coexist or if the implementation of the first adversely impacts the second, then they are considered conflicts in our dataset. An example of such a conflict can be observed in the following requirements:

  1. 1.

    The UAV shall charge to 50 % in less than 3 hours.

  2. 2.

    The UAV shall fully charge in less than 3 hours.

It’s not feasible to implement these requirements simultaneously as it will lead to inconsistency in the system. Notably, for the purpose of this study, requirements deemed as duplicates or paraphrased versions of each other are also considered conflicts due to their inherent redundancy.

Functional requirements specify the functionalities or features a system must possess and describe how the system should behave or what it should do. They outline the specific actions the system must be able to perform and typically address the system’s core operations. Table 1 lists the studies for conflict identification in functional requirements. The table provides insights into the domain of SRS documents, the datasets employed, and the types of conflicts addressed in each study. The majority of these studies utilize rule-based and heuristic methods, facing limitations associated with the scarcity of extensive datasets and the absence of a standardized methodology. Often reliant on case-study approaches, these investigations typically validate their methods using a limited set of requirements, posing challenges for direct comparisons with our study.

Table 1: Conflict detection literature for functional requirements in SRS documents.
Study Conflict Detection Method Domain Dataset Type of Conflicts
[17] Rule-based and Genetic Algorithms (GA) Software 3 SRS documents Functional conflicts
[10] Rule-based and semantics Varied 3 SRS (validation), 2 SRS (testing) Functional conflicts
[18] Fuzzy branching temporal logic Internet of Things (IOT) System of Systems (SOS) Req. Conflicts in resource-based req.
[19] WebSpec tool and semantics Web software Case study Structural conflicts
[20] Ontology and Semantics Software Casestudy Simple and complex conflicts
[15] Automated Software 25 Req. Conflicts in Mandatory, essential and optional req.
[21] Rule-based & Tracing dependencies Software Voter registration system Conflicts in Essential Use Cases (EUC)
[22] Linear temporal logic Bellcore industry Telecom dataset Functional Conflicts
[23] Heuristics and Req. interactions Mechanical 12 Req. Lift System Interactions in Requirements

Guo et al. [10] introduced a methodical approach, FSARC (a Finer Semantic Analysis-based Requirements Conflict Detector), aiming for a comprehensive semantic analysis of software requirements to identify conflicts. FSARC follows a seven-step procedure leveraging Stanford’s CoreNLP library [24]. The initial steps involve Part-of-Speech (POS) tagging and Stanford’s Dependency Parser (SDP) to transform each requirement into an eight-tuple representation (id,group_id,event,agent,operation,input,output,restriction)𝑖𝑑𝑔𝑟𝑜𝑢𝑝_𝑖𝑑𝑒𝑣𝑒𝑛𝑡𝑎𝑔𝑒𝑛𝑡𝑜𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑖𝑛𝑝𝑢𝑡𝑜𝑢𝑡𝑝𝑢𝑡𝑟𝑒𝑠𝑡𝑟𝑖𝑐𝑡𝑖𝑜𝑛{(id,group\_id,event,agent,operation,input,output,restriction)}( italic_i italic_d , italic_g italic_r italic_o italic_u italic_p _ italic_i italic_d , italic_e italic_v italic_e italic_n italic_t , italic_a italic_g italic_e italic_n italic_t , italic_o italic_p italic_e italic_r italic_a italic_t italic_i italic_o italic_n , italic_i italic_n italic_p italic_u italic_t , italic_o italic_u italic_t italic_p italic_u italic_t , italic_r italic_e italic_s italic_t italic_r italic_i italic_c italic_t italic_i italic_o italic_n ). Subsequent rule-based routines are applied to identify conflicts based on this tuple. While the algorithm exhibited promising results and potential for generalization, it heavily depends on the CoreNLP library’s accurate generation of the eight-tuple, suggesting a reliance on a specific requirement structure for effective analysis.

Non-functional requirements define the criteria that characterize the operation of a system without detailing specific behaviors. These requirements focus on aspects such as performance, reliability, usability, scalability, and other qualities that are essential for the overall effectiveness and efficiency of the system but are not related to its specific functionalities. Table 2 presents the conflict identification studies for non-functional requirements.

Table 2: Conflict detection literature for non-functional requirements in SRS documents.
Study Conflict Detection Method Domain Dataset Type of Conflicts
[25] Rule-based and Clustering Software 15 ATM Req. and 5 SRS documents Intra-conflicts
[26] SVM and BiLSTM model with Word2Vec embeddings Software 200 Req. Software Product Quality
[27] Rule-based and Manual Telecom 14 Req. Req. Conflicts in OAM&P
[28] Quantitative analysis Chemical Casestudy with 2 req.
[13] Rule-based Software NFRs Security and usability
[29] Rule-based with quality attributes Software Case study Mutually exclusive and partial
[4] Rule-based and Manual Software 12 Req. Video on Demand Req. Traceability

Similar to the studies discussed in the previous section, the studies in Table 2 predominantly relied on rule-based methods and conducted validation on a limited number of requirements. However, the lack of substantial evidence hampers the demonstration of the generalizability of the methods.

Lastly, we note that our work differs from these existing studies in multiple ways:

  • Das et al. [30] introduced the idea of sentence embeddings for similarity detection in software requirements. We extend this idea and combine the two sentence embeddings (SBERT and TFIDF) to calculate the cosine similarity between the requirements.

  • Our proposed automated approaches directly works with software requirements as opposed to Guo et al. [10], which converts the software requirements into formal representations and apply rule-based procedures to detect the conflicts.

  • Different from finer semantic analysis enabled by Guo et al. [10]’s rule-based approach, we define the set of software-specific entities, and train NLP-based transformer models with for software requirements. These entities provide an additional way of verifying the conflicts semantically.

  • To provide the generic technique for conflict identification, we devise an unsupervised approach capable of handling raw requirements from varied domains and effectively capturing conflicts.

3 Methodology

In this section, we first describe the SRS datasets used in our numerical study. Then, we provide specific details of the building blocks of our conflict detection algorithms and the experimental setup.

3.1 Datasets

We consider five SRS datasets that belong to various domains such as software, healthcare, transportation, and hardware. Three of these are open-source SRS datasets (OpenCoss, WorldVista, and UAV), and the other two are extracted from public SRS documents. Generally, requirements are documented in a structured format and we retain the original structure of requirements for conflict detection process. To maintain the consistency in requirement structure, we converted complex requirements (e.g., paragraphs or compound sentences) into simple sentences. Table 3 provides summary information on the SRS datasets.

Table 3: Dataset characteristics
Dataset Domain # Non-conflicts # Known Conflicts # Synthetic Conflicts
OpenCoss Transportation 97 4 16
WorldVista Medical 78 10 60
UAV Aerospace 80 10 26
PURE Thermodynamics 27 0 40
IBM-UAV Hardware 75 0 28

We briefly describe these SRS datasets below.

  • OpenCoss: OPENCOSS111http://www.opencoss-project.eu refers to Open Platform for Evolutionary Certification Of Safety-critical Systems for the railway, avionics, and automotive markets. This is a challenging dataset to identify the conflicts as the samples from the OpenCoss dataset indicates a lot of similar or duplicate requirements with repeating words. Initially, this set included 110 requirements and we added 5 more synthetic conflicts.

  • WorldVista: WorldVista222http://coest.org/datasets is a health management system that records patient information starting from the hospital admission to discharge procedures. The requirement structure is basic, and written in natural language with health care terminologies. It originally consisted of 117 requirements and we added 23 synthetic conflicts.

  • UAV: The UAV (Unmanned Aerial Vehicle) [10, 31] dataset is created by the University of Notre Dame and it includes all the functional requirements which define the functions of the UAV control system. The requirement syntax is based on the template of EARS (Easy Approach to Requirements Syntax) [32]. Originally, this dataset had 99 requirements and we added 16 conflicting requirements to the set, which resulted in a conflict proportion of 30%.

  • PURE: PURE (Public Requirements dataset), contains 79 publicly available SRS documents collected from the web [33]. We manually extracted set of requirements from two SRS documents, namely, THEMAS (Thermodynamic System) and Mashbot (web interface for managing a company’s presence on social networks). In total, we collected 83 requirements and induced synthetic 21 conflicts to maintain consistency with the other datasets.

  • IBM-UAV: This dataset is proprietary, and provided by IBM. It consists of software requirements used in various projects related to the aerospace and automobile industry. We sampled 75 requirements from the original set, and introduced 13 synthetic conflicts. The requirement text follows a certain format specified by IBM’s RQA (Requirement Quality Analysis) system.

The synthetic conflicts were introduced to each of these datasets by following the standard definitions provided in the literature [3, 34]. Table 4 shows sample synthetic conflicts as indicated by requirement id, requirement text, a ‘Conflict’ column indicating the presence of conflict in ‘Yes’ or ‘No’ format and a ‘Conflict-Label’ column that indicates the pair of conflicts. For example, requirements 2 and 3 conflict with each other because of modal verb used in both requirements. Similarly, requirements 11 and 12 show mathematical operator conflict.

Table 4: Sample conflict and non-conflict instances from each dataset with requirement id, text, and conflict label (Yes/No).
Dataset Req. Id Requirement text Conflict Conflict-Label
OpenCoss 1. The OPENCOSS platform shall store the characteristics of the evidence items of an assurance project according to the CCL. No No
2. The OPENCOSS platform must provide users with the ability to specify evidence traceability links in traceability matrices. Yes Yes (3)
3. The OPENCOSS platform shall provide users with the ability to specify evidence traceability links in traceability matrices. Yes Yes (2)
WorldVista 4. The system’s pilot program shall use a smart card to digitally sign medication orders. Yes Yes (5)
5. The system’s pilot program shall require a handwritten signature for medication orders. Yes Yes (4)
6. The system shall sort notifications based on column heading: Patient name (alphabetical or reverse alphabetical). No No
IBM-UAV 7. The UAV shall charge to 50 % in less than 3 hours. Yes Yes (8)
8. The UAV shall fully charge in less than 3 hours. Yes Yes (7)
9. Remote surveillance shall include video streaming for manual navigation of the surveillance platform. No No
PURE 10. If LO <=<=< = T <=<=< = UO, then this process shall output the temperature status. Yes Yes (11)
11. If LO <=<=< = T <<< UO, then this process shall output the temperature status. Yes Yes (12)
12. The THEMAS system shall maintain the ON/OFF status of each heating and cooling unit. No No
UAV 13. The _InternalSimulator_ shall approximate the behavior of a UAV. No No
14. The _VehicleCore_ shall support virtual UAVs. Yes Yes (15)
15. The _VehicleCore_ shall support up to three virtual UAVs. Yes Yes (14)

3.2 S3CDA: Supervised Semantic Similarity-based Conflict Detection Algorithm

This section is structured into two parts as depicted in Figure 1. First, we explain the similarity-based conflict detection. Second, we define the semantic-based conflict identification to validate the potential conflicts obtained in Phase I.

Refer to caption
Figure 1: Supervised Semantic Similarity-based Conflict Detection Algorithm (S3CDA). Proposed framework for identifying the conflicting requirements in SRS documents.

3.2.1 Phase I: Similarity-based Conflict Detection

Algorithm 1 formalizes similarity-based conflict detection procedure (Phase I) provided in the left panel of Figure 1. The resulting set of conflicts can be used as a candidate conflict set for Phase II of our framework. We first create the sentence embedding vector for each requirement r𝑟r\in\mathcal{R}italic_r ∈ caligraphic_R using SentenceEmbedding(\mathcal{R}caligraphic_R) procedure. It basically converts the requirements into numerical vector using sentence embeddings. Below, we describe the various sentence embedding models employed in the proposed algorithms.

  • TFIDF: Term Frequency Inverse Document Frequency, is employed for generating vectors from requirements [35]. Each requirement is treated as a document, and TFIDF scores are calculated for each term in the document. The TFIDF value for a term i𝑖iitalic_i’ in a document d𝑑ditalic_d is given by TFi,d×IDFisubscriptTF𝑖𝑑subscriptIDF𝑖\text{TF}_{i,d}\times\text{IDF}_{i}TF start_POSTSUBSCRIPT italic_i , italic_d end_POSTSUBSCRIPT × IDF start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where TFi,dsubscriptTF𝑖𝑑\text{TF}_{i,d}TF start_POSTSUBSCRIPT italic_i , italic_d end_POSTSUBSCRIPT is the term frequency in document d𝑑ditalic_d’, and IDFisubscriptIDF𝑖\text{IDF}_{i}IDF start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the inverse document frequency.

  • USE: Universal Sentence Encoder (USE) translates natural language text into high-dimensional vectors [36]. We utilize the Deep Averaging Network (DAN) version of USE, a pre-trained model optimized for encoding sentences, phrases, and short paragraphs. It produces 512-dimensional vectors for each input requirement.

  • SBERT–TFIDF: This method combines Sentence-BERT [37] (SBERT) and TFIDF embeddings. SBERT provides context and semantics in the vectors, while TFIDF prioritizes less frequent words. The process involves concatenating the vectors and employing Uniform Manifold Approximation and Projection (UMAP) for dimensionality reduction to ensure uniform vector size [38].

We next calculate the pairwise distance matrix (ΔΔ\Deltaroman_Δ), which measures the cosine similarity value between each pair of requirements. Below, we provide a brief example of using cosine similarity for requirements.

Cosine Similarity:

It is an effective measure to estimate the similarity of vectors in high-dimensional space [39]. This metric models language-based input text as a vector of real-valued terms and the similarity between two texts is derived from the cosine angle between two texts term vectors as follows:

cos(𝐫𝟏,𝐫𝟐)=𝐫𝟏𝐫𝟐𝐫𝟏𝐫𝟐=i=1n𝐫𝟏i𝐫𝟐ii=1n(𝐫𝟏i)2i=1n(𝐫𝟐i)2cossubscript𝐫1subscript𝐫2subscript𝐫1subscript𝐫2normsubscript𝐫1normsubscript𝐫2superscriptsubscript𝑖1𝑛subscriptsubscript𝐫1𝑖subscriptsubscript𝐫2𝑖superscriptsubscript𝑖1𝑛superscriptsubscriptsubscript𝐫1𝑖2superscriptsubscript𝑖1𝑛superscriptsubscriptsubscript𝐫2𝑖2\texttt{cos}({\bf r_{1}},{\bf r_{2}})={{\bf r_{1}}{\bf r_{2}}\over\|{\bf r_{1}% }\|\|{\bf r_{2}}\|}=\frac{\sum_{i=1}^{n}{{\bf r_{1}}_{i}{\bf r_{2}}_{i}}}{% \sqrt{\sum_{i=1}^{n}{({\bf r_{1}}_{i})^{2}}}\sqrt{\sum_{i=1}^{n}{({\bf r_{2}}_% {i})^{2}}}}cos ( bold_r start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT , bold_r start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ) = divide start_ARG bold_r start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT bold_r start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_r start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT ∥ ∥ bold_r start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT ∥ end_ARG = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_r start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_r start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( bold_r start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG square-root start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( bold_r start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG (1)

The values for cosine similarity ranges from -1 to 1 where -1 signifies dissimilarity and 1 signifies similarity. To better demonstrate how cosine similarity can be used over embedding vectors, we provide an illustrative example with three sample software requirements, r1subscript𝑟1r_{1}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, r2subscript𝑟2r_{2}italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and r3subscript𝑟3r_{3}italic_r start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, which are defined as follows:

  • r1subscript𝑟1r_{1}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ‘The OPENCOSS platform shall be able to export evidence traceability links of an assurance project to external tools.’

  • r2subscript𝑟2r_{2}italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ‘The OPENCOSS platform must be able to send out evidence traceability links of an assurance project to external tools and internal tools.’

  • r3subscript𝑟3r_{3}italic_r start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = ‘The OPENCOSS platform shall provide users with the ability to specify evidence traceability links in traceability matrices.’

We calculate the cosine similarity between these requirement vectors when embedded with TFIDF, SBERT, SBERT-TFIDF, and USE, which are reported in Table 5. SBERT is able to capture the semantic similarity between r1subscript𝑟1r_{1}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and r2subscript𝑟2r_{2}italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with a cosine similarity value of 0.96. USE being the second highest with the value of 0.81. Requirement text for r3subscript𝑟3r_{3}italic_r start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is not similar to those of the other two requirements, and all the sentence embeddings indicate low values of cosine similarity with r3subscript𝑟3r_{3}italic_r start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT.

Table 5: Cosine similarity between r1subscript𝑟1r_{1}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, r2subscript𝑟2r_{2}italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and r3subscript𝑟3r_{3}italic_r start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT with different sentence embeddings.
TFIDF USE SBERT SBERT-TFIDF
cos(r1subscript𝑟1r_{1}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, r2subscript𝑟2r_{2}italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) 0.44 0.81 0.96 0.72
cos(r1subscript𝑟1r_{1}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, r3subscript𝑟3r_{3}italic_r start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT) 0.16 0.55 0.84 0.50
cos(r2subscript𝑟2r_{2}italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, r3subscript𝑟3r_{3}italic_r start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT) 0.09 0.40 0.83 0.46

Then, we use ROC curve (receiver operating characteristic curve) to identify the cosine similarity threshold (δ𝛿\deltaitalic_δ), which specifies the minimum similarity value after which requirements are labeled as conflicting. The cutoff value (δ𝛿\deltaitalic_δ) is selected as the value that maximizes {TPR(Δ,k)(1FPR(Δ,k))}TPRΔ𝑘1FPRΔ𝑘\left\{\text{TPR}(\Delta,k)-(1-\text{FPR}(\Delta,k))\right\}{ TPR ( roman_Δ , italic_k ) - ( 1 - FPR ( roman_Δ , italic_k ) ) } over threshold values k{0.01,,1.00}𝑘0.011.00k\in\left\{0.01,\ldots,1.00\right\}italic_k ∈ { 0.01 , … , 1.00 } and the distance matrix ΔΔ\Deltaroman_Δ. This way, we balance the false positives and true positives rates, with the conflicts having the positive labels. Lastly, we assign labels of conflict or no-conflict to the requirements using δ𝛿\deltaitalic_δ as threshold value. The candidate conflict set (𝒞𝒞\mathcal{C}caligraphic_C) contains all the requirements with conflict label. Note that conflict property is symmetric, i.e., if r1subscript𝑟1r_{1}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is conflicting with r2subscript𝑟2r_{2}italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, then r2subscript𝑟2r_{2}italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is also conflicting with r1subscript𝑟1r_{1}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and, r1,r2𝒞subscript𝑟1subscript𝑟2𝒞r_{1},r_{2}\in\mathcal{C}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ caligraphic_C.

Algorithm 1 Similarity-based Conflict Detection
Requirement set ={r1,r2,,rn}subscript𝑟1subscript𝑟2subscript𝑟𝑛\mathcal{R}=\{r_{1},r_{2},\ldots,r_{n}\}caligraphic_R = { italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } Output: Conflict set 𝒞𝒞\mathcal{C}caligraphic_C
absent\vec{\mathcal{R}}\leftarrowover→ start_ARG caligraphic_R end_ARG ← SentenceEmbedding(\mathcal{R}caligraphic_R) // Generate requirement vectors
ΔPairwiseDistance()ΔPairwiseDistance()\Delta\leftarrow\texttt{PairwiseDistance($\vec{\mathcal{R}}$)}roman_Δ ← PairwiseDistance( over→ start_ARG caligraphic_R end_ARG ) // Get similarity matrix
δargmaxk{0.01,,1.00}{TPR(Δ,k)(1FPR(Δ,k))}𝛿subscriptargmax𝑘0.011.00TPRΔ𝑘1FPRΔ𝑘\displaystyle\delta\leftarrow\operatorname*{arg\,max}_{k\in\left\{0.01,\ldots,% 1.00\right\}}\big{\{}\text{TPR}(\Delta,k)-(1-\text{FPR}(\Delta,k))\big{\}}italic_δ ← start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_k ∈ { 0.01 , … , 1.00 } end_POSTSUBSCRIPT { TPR ( roman_Δ , italic_k ) - ( 1 - FPR ( roman_Δ , italic_k ) ) } // Get similarity cutoff
𝒞𝒞absent\mathcal{C}\leftarrowcaligraphic_C ← AssignLabels(\vec{\mathcal{R}}over→ start_ARG caligraphic_R end_ARG,δ𝛿\deltaitalic_δ) // Label the requirements

3.2.2 Phase II: Semantic-based Conflict Detection

Algorithm 2 describes the process of semantic conflict detection as presented in right panel of Figure 1. This algorithm serves as a second filter on the candidate conflicts generated in Phase I. Specifically, any candidate conflict c𝒞𝑐𝒞c\in\mathcal{C}italic_c ∈ caligraphic_C is semantically compared against top m𝑚mitalic_m most similar requirements from \mathcal{R}caligraphic_R. That is, by focusing on only m𝑚mitalic_m most similar requirements, we reduce the computational burden, and also make use of the cosine similarity between the requirements. This semantic comparison is performed based on overlap ratio between the entities present in the requirements. For a given candidate conflict c𝒞𝑐𝒞c\in\mathcal{C}italic_c ∈ caligraphic_C, overlap ratio is calculated as

vc=maxrc{Overlap(c,r)}UniqueEntities(c)subscript𝑣𝑐subscript𝑟subscript𝑐Overlap𝑐𝑟UniqueEntities𝑐\displaystyle v_{c}=\frac{\max_{r\in\mathcal{L}_{c}}\big{\{}\texttt{Overlap}(c% ,r)\big{\}}}{\texttt{UniqueEntities}(c)}italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG roman_max start_POSTSUBSCRIPT italic_r ∈ caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUBSCRIPT { Overlap ( italic_c , italic_r ) } end_ARG start_ARG UniqueEntities ( italic_c ) end_ARG (2)

where csubscript𝑐\mathcal{L}_{c}caligraphic_L start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT represents the set of m𝑚mitalic_m most similar requirements to candidate conflict c𝑐citalic_c. The function Overlap(c,r)Overlap𝑐𝑟\texttt{Overlap}(c,r)Overlap ( italic_c , italic_r ) calculates the number of overlap** entities between c𝑐citalic_c and r𝑟ritalic_r, and function UniqueEntities(c)UniqueEntities𝑐\texttt{UniqueEntities}(c)UniqueEntities ( italic_c ) calculates the number of unique entities in candidate conflict (i.e., a requirement text) c𝑐citalic_c. The calculated overlap ratio vcsubscript𝑣𝑐v_{c}italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT for c𝒞𝑐𝒞c\in\mathcal{C}italic_c ∈ caligraphic_C is then compared against a pre-determined overlap threshold value, Tosubscript𝑇𝑜T_{o}italic_T start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT, and c𝑐citalic_c is added to final conflict set 𝒞¯¯𝒞\bar{\mathcal{C}}over¯ start_ARG caligraphic_C end_ARG if vcTosubscript𝑣𝑐subscript𝑇𝑜v_{c}\geq T_{o}italic_v start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ≥ italic_T start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT. In our analysis, we set m=5𝑚5m=5italic_m = 5 and To=1subscript𝑇𝑜1T_{o}=1italic_T start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 1, which are determined based on preliminary experiments. In Algorithm 2, GetSimilarRequirements(c,,m𝑐𝑚c,\mathcal{R},mitalic_c , caligraphic_R , italic_m) returns the list \mathcal{L}caligraphic_L of m𝑚mitalic_m most similar requirements from \mathcal{R}caligraphic_R for candidate conflict c𝑐citalic_c, and GetMaxOverlapRatio(c,𝑐c,\mathcal{L}italic_c , caligraphic_L) returns the maximum value for the overlaps between requirements from set \mathcal{L}caligraphic_L and candidate conflict c𝑐citalic_c.

Algorithm 2 Semantic Conflict Detection

Require:
        Candidate conflict set: 𝒞={c1,c2,,ct}𝒞subscript𝑐1subscript𝑐2subscript𝑐𝑡\mathcal{C}=\{c_{1},c_{2},\ldots,c_{t}\}caligraphic_C = { italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT }
        Requirement set: ={r1,r2,,rn}subscript𝑟1subscript𝑟2subscript𝑟𝑛\mathcal{R}=\{r_{1},r_{2},\ldots,r_{n}\}caligraphic_R = { italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }
        # of similar requirements: m𝑚mitalic_m
   
Output: Refined conflict set 𝒞¯¯𝒞\bar{\mathcal{C}}over¯ start_ARG caligraphic_C end_ARG
   
Initialization:
        𝒞¯=¯𝒞\bar{\mathcal{C}}=\emptysetover¯ start_ARG caligraphic_C end_ARG = ∅ // Initialize conflict set to an empty set
        To=1subscript𝑇𝑜1T_{o}=1italic_T start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 1 // Set overlap** threshold as 1
        m=5𝑚5m=5italic_m = 5 // Set number of similar requirements as 5
   
For c𝒞𝑐𝒞c\in\mathcal{C}italic_c ∈ caligraphic_C do:
        \mathcal{L}caligraphic_L \leftarrow GetSimilarRequirements(c,,m𝑐𝑚c,\mathcal{R},mitalic_c , caligraphic_R , italic_m) // Get the m𝑚mitalic_m similar requirements
        v𝑣vitalic_v \leftarrow GetMaxOverlapRatio(c,𝑐c,\mathcal{L}italic_c , caligraphic_L) // Calculate max. overlap ratio using Eqn. (2)
       
If vTo𝑣subscript𝑇𝑜v\geq T_{o}italic_v ≥ italic_T start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT : // Compare with threshold
            𝒞¯𝒞¯{c}¯𝒞¯𝒞𝑐\bar{\mathcal{C}}\leftarrow\bar{\mathcal{C}}\cup\left\{c\right\}over¯ start_ARG caligraphic_C end_ARG ← over¯ start_ARG caligraphic_C end_ARG ∪ { italic_c } // Augment the final conflict set

To extract the entities from the requirements, we employ two NER techniques described below.

  • Part-of-Speech (POS) Tagging: [40] suggest that a software requirement should follow the structure as Actor (Noun) + Object + Action (Verb) + Resource. The generic NER method extracts ‘Noun’ and ‘Verb’ tags from the requirements based on this structure and referred as ‘POS’ tagging. We employ POS tagger provided in SpaCy library in Python.

  • Software-specific Named Entity Recognition (S-NER): NER serves to extract relevant entities from input text, and its effectiveness can be enhanced by training machine learning models on domain-specific corpora. In our context, we leverage a software-specific NER system to extract entities crucial for understanding requirements, specifically focusing on actor, action, object, property, metric, and operator.

    In the context of requirements, an “Actor” denotes an entity interacting with the software application, while an “Action” represents an operation performed by an actor within the software system. To illustrate, consider the requirement “The UAV shall charge to 75% in less than 3 hours,” where UAV serves as the actor, and charge as the corresponding action. We employ transformer models trained on software-specific corpora to proficiently extract these entities from requirement pairs [41].

We also provide sample calculations in Table 6 to better illustrate the process in Algorithm 2. We show the overlap** software-specific entities present in candidate requirement (c𝑐citalic_c) and similar requirement (r𝑟r\in\mathcal{L}italic_r ∈ caligraphic_L) with different color codes. For instance, the entity ‘UAV’ is represented by blue color. The requirements r1subscript𝑟1r_{1}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and r2subscript𝑟2r_{2}italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT both return high overlap ratios, indicating a conflict with c𝑐citalic_c.

Table 6: Sample candidate requirement (c𝑐citalic_c) and set of similar requirements (\mathcal{L}caligraphic_L) to calculate the overlap** ratio (v𝑣vitalic_v). The maximum count of overlap is 7 which resulted in overlap ratio value v𝑣vitalic_v as 1.
Candidate Requirement Similar Requirements Overlap(c,r)𝑐𝑟(c,r)( italic_c , italic_r ) Ratio (v𝑣vitalic_v)
r1subscript𝑟1r_{1}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ’The UAV flight range shall be no less than 20 miles.’ 7 1.00
r2subscript𝑟2r_{2}italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ‘The UAV flight range shall be a minimum of 20 miles.’ 5 0.71
c𝑐citalic_c = ‘The UAV flight range shall be no less than 20 kilometers.’ r3subscript𝑟3r_{3}italic_r start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = ‘The UAV shall be able to autonomously a flight plan consisting of a set of waypoints within its range and flight capabilities.’ 2 0.28
r4subscript𝑟4r_{4}italic_r start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = ‘The UAV shall be able to autonomously plan a flight consisting of a set of waypoints within its range and flight capabilities.’ 2 0.28
r5subscript𝑟5r_{5}italic_r start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = ‘The Pilot controller shall be able to download a flight plan from the Hummingbird consisting of a set of waypoints within the flight range of the Hummingbird.’ 2 0.28

3.3 UnSupCDA: Unsupervised Conflict Detection Algorithm

We devise an unsupervised variant of the S3CDA approach to alleviate the need for labeled conflict data in the task of identifying conflicts within SRS documents. This unsupervised version seamlessly integrates key components from both phases of the original S3CDA approach, adapting them to function without the reliance on labeled data. By combining the strengths of the two phases, our unsupervised approach retains the efficiency and effectiveness of conflict identification while eliminating the necessity for manual annotation.

In Algorithm 3, we first transform requirements into embedding vectors using SentenceEmbedding function, capturing their semantic content. Subsequently, a pairwise cosine similarity matrix (ΔΔ\Deltaroman_Δ) is computed using PairwiseDistance function. For each requirement in \mathcal{R}caligraphic_R, we identify the n𝑛nitalic_n similar requirements and simultaneously calculate the entity overlap** ratio using Equation 2. Then, we employs a predefined threshold (Tosubscript𝑇𝑜T_{o}italic_T start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT) to determine the conflicts.

Algorithm 3 UnSupCDA

Require:
        Requirement set: ={r1,r2,,rn}subscript𝑟1subscript𝑟2subscript𝑟𝑛\mathcal{R}=\{r_{1},r_{2},\ldots,r_{n}\}caligraphic_R = { italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }
   
Output: Conflict set 𝒞¯¯𝒞\bar{\mathcal{C}}over¯ start_ARG caligraphic_C end_ARG
   
Initialization:
        𝒞¯=¯𝒞\bar{\mathcal{C}}=\emptysetover¯ start_ARG caligraphic_C end_ARG = ∅ // Initialize conflict set to an empty set
        To={1.0,0.75,0.5}subscript𝑇𝑜1.00.750.5T_{o}=\{1.0,0.75,0.5\}italic_T start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = { 1.0 , 0.75 , 0.5 } // Overlap** threshold set
        n={1,2,3}𝑛123n=\{1,2,3\}italic_n = { 1 , 2 , 3 } // n highest similarity requirement
    absent\vec{\mathcal{R}}\leftarrowover→ start_ARG caligraphic_R end_ARG ← SentenceEmbedding(\mathcal{R}caligraphic_R) // Generate requirement vectors
    ΔPairwiseDistance()ΔPairwiseDistance()\Delta\leftarrow\texttt{PairwiseDistance($\vec{\mathcal{R}}$)}roman_Δ ← PairwiseDistance( over→ start_ARG caligraphic_R end_ARG ) // Get similarity matrix
   
For risubscript𝑟𝑖r_{i}\in\mathcal{R}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_R do:
       (ri,rj)subscript𝑟𝑖subscript𝑟𝑗absent(r_{i},r_{j})\leftarrow( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ← n_HighestSimilarityPairs(ΔΔ\Deltaroman_Δ, n𝑛nitalic_n)
        v𝑣vitalic_v \leftarrow GetMaxOverlapRatio(ri,rjsubscript𝑟𝑖subscript𝑟𝑗r_{i},r_{j}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT) // max. overlap ratio using Eqn. (2)
       
If vTo𝑣subscript𝑇𝑜v\geq T_{o}italic_v ≥ italic_T start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT:
            𝒞¯𝒞¯{ri,rj}¯𝒞¯𝒞subscript𝑟𝑖subscript𝑟𝑗\bar{\mathcal{C}}\leftarrow\bar{\mathcal{C}}\cup\left\{r_{i},r_{j}\right\}over¯ start_ARG caligraphic_C end_ARG ← over¯ start_ARG caligraphic_C end_ARG ∪ { italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT }

3.3.1 Overlap** Entity Ratio v/s Cosine Similarity

We analyze the correlation between the overlap** entity ratio, as defined in Equation 2, and cosine similarity through scatter plot visualizations. The requirements from each dataset are embedded using the SBERT model, and pairwise cosine similarity is computed. Subsequently, the most similar requirement is identified for each requirement in the dataset. To calculate the entity overlap** ratio, we employ POS tagging and the S-NER method for entity extraction. The overlap** ratio is then determined based on the extracted entities. Our objective is to investigate whether there exists a relationship between the cosine similarity of requirements and their entity overlap ratio, as both metrics play crucial roles in our proposed algorithms.

Examining Figure 2, we discern an absence of a definitive correlation. Notably, in the case of the WorldVista dataset, Figures 2a and 2b reveal a weak positive correlation, where high values of cosine similarity align with high overlap** ratios. This trend is consistent across other datasets. In the OpenCoss dataset, Figure 3b illustrates that most requirements exhibit both high cosine similarity and overlap** ratios, posing challenges for our conflict detection algorithms and elucidating the observed performance metrics. For other datasets, the scatter plots are presented in Appendix 7.1.

Refer to caption
(a) POS Tagging
Refer to caption
(b) SNER
Figure 2: Scatterplot depicting the relationship between Cosine Similarity and Overlap** Entity Ratio among WorldVista requirements.
Refer to caption
(a) POS Tagging
Refer to caption
(b) SNER
Figure 3: Scatterplot depicting the relationship between Cosine Similarity and Overlap** Entity Ratio among OpenCoss requirements.

3.4 Experimental Setup

For the evaluation of our proposed approaches, we perform 3-fold cross validation over all the requirement datasets. That is, considering the distribution of conflicting requirements and the limited number of requirements, we divide each dataset into 3 different folds. Each fold includes some conflicting and non-conflicting requirements, however, we make sure that each conflict present in the fold should have its conflict pair present in the same fold. For our techniques, we use training set to determine the cosine similarity cut-off value, and apply this value on the corresponding test set. For S3CDA, we employ standard classification metrics such as macro averaged version of F1-score, Precision, and Recall in performance evaluation. Table 7 reports the hyperparameter and model configurations for the models used in our proposed approaches.

Table 7: Model configurations and hyperparameters used in the proposed approach.
Model Checkpoint/library Model configurations/hyperparameters
SBERT all-MiniLM-L12-V2333https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2 12-layer, 384-hidden, 12-heads, 33M parameters
all-mpnet-base-v2444https://huggingface.co/sentence-transformers/all-mpnet-base-v2 12-layer, 768-hidden, 12-heads, 110M parameters
USE TensorFlow 2.0 (v4) 512-dimensional vector
TFIDF scikit-learn (sklearn)555https://scikit-learn.org/stable/ n_gram = (1,4), 1024-dimensional vector
S-NER Bert-base-cased [41] 12-layer, 768-hidden, 12-heads, 110M parameters
POS tagging spaCy666https://spacy.io/ model-“en_core_web_sm”

In the case of UnSupCDA, being an unsupervised task, we adopt the evaluation metrics introduced by Guo et al. [10] to assess the accuracy of conflict identification. The metrics include Precision and Recall, calculated as follows:

Precision=Correctly DetectedOverall Detected Conflicts,Recall=Correctly DetectedOverall Known Conflictsformulae-sequencePrecisionCorrectly DetectedOverall Detected ConflictsRecallCorrectly DetectedOverall Known Conflicts\texttt{Precision}=\frac{\textit{Correctly Detected}}{\textit{Overall Detected% Conflicts}},\quad\texttt{Recall}=\frac{\textit{Correctly Detected}}{\textit{% Overall Known Conflicts}}Precision = divide start_ARG Correctly Detected end_ARG start_ARG Overall Detected Conflicts end_ARG , Recall = divide start_ARG Correctly Detected end_ARG start_ARG Overall Known Conflicts end_ARG

In these calculations, “Correctly Detected Conflicts”, also known as True Positives (TP), represent the instances where the algorithm correctly identifies conflicting requirements as conflict. “Overall Detected Conflicts”, also known as False Positives (FP) refer to the total number of instances where the algorithm identifies a requirement as potentially conflict, regardless of whether this identification is conflict or not. “Overall Known Conflicts” is the number of conflicts labeled by experts as true conflicts.

For entity extraction, we utilise the transformer model checkpoint trained in Malik et al. [41]. We directly employ the best model Bert-base-cased to our software requirements to calculate the software-specific entities.

4 Results

In this section, we first assess the performance of the supervised conflict detection algorithm (S3CDA), and present the results from our comparative analysis with various sentence embeddings for each of the requirement datasets. Next, we present the performance of our unsupervised conflict detection algorithm.

4.1 S3CDA Performance Evaluation

In our proposed approach S3CDA, Phase I, the step that comes after calculating the similarity matrix between the requirements is the generation of ROC curves, which are used to obtain the cosine similarity cut-off for conflict detection based on TPR and FPR values. Figure 4 shows the ROC curves for the UAV dataset with USE embedding in each fold. We generate similar ROC curves for all the other requirement sets with best sentence embeddings which are provided in the appendix (see Section 7).

Refer to caption
(a) Fold - 1
Refer to caption
(b) Fold - 2
Refer to caption
(c) Fold - 3
Figure 4: ROC curves for UAV dataset with USE embedding across 3 folds

Table 8 presents a comparison of various sentence embeddings in Phase I of the S3CDA approach. The performance across different embeddings exhibits no discernible pattern, with TFIDF performing consistently at an average level. Notably, for PURE and IBM-UAV datasets, SBERT achieves F1-scores of 89.6% and 71.1%, respectively. In the case of UAV, the USE embedding attains the highest F1-score of 92.3%, while for WorldVista, SBERT–TFIDF achieves an F1-score of 87.1%. Surprisingly, the OpenCoss dataset demonstrates the highest F1-score (57.0%) with TFIDF embeddings. This anomaly may be attributed to the challenging nature of the OpenCoss dataset, characterized by a complex word structure and a substantial overlap in vocabulary among requirements. This overlap proves advantageous for frequency-based TFIDF embeddings in capturing relevant patterns.

Table 8: Results for various sentence embedding with requirement datasets are reported using macro-averaged metrics - Precision (P), Recall (R), and F1-score (F1) expressed as “mean ± std” across three folds. The evaluation also includes the number of Potential and Correctly Detected conflicts in comparison to known conflicts for each dataset. Best embedding is highlighted in blue color and yellow shading indicates noteworthy results. Here, E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: TFIDF, E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: USE, E3subscript𝐸3E_{3}italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT: SBERT, E4subscript𝐸4E_{4}italic_E start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT: SBERT–TFIDF.
Dataset Embeddings Cosine P R F1 Potential Correctly
Cutoff Conflicts Detected
OpenCoss 𝐄𝟏subscript𝐄1\mathbf{E_{1}}bold_E start_POSTSUBSCRIPT bold_1 end_POSTSUBSCRIPT 0.700 0.433 ±plus-or-minus\pm± 0.118 0.847 ±plus-or-minus\pm± 0.137 0.570 ±plus-or-minus\pm± 0.130 41 / 20 17 / 20
E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.936 0.366 ±plus-or-minus\pm± 0.059 0.652 ±plus-or-minus\pm± 0.272 0.461 ±plus-or-minus\pm± 0.119 34 / 20 13 / 20
E3subscript𝐸3E_{3}italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.956 0.354 ±plus-or-minus\pm± 0.041 0.805 ±plus-or-minus\pm± 0.141 0.484 ±plus-or-minus\pm± 0.026 46 / 20 16 / 20
E4subscript𝐸4E_{4}italic_E start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 0.826 0.373 ±plus-or-minus\pm± 0.161 0.722 ±plus-or-minus\pm± 0.207 0.487 ±plus-or-minus\pm± 0.185 41 / 20 14 / 20
WorldVista E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.206 0.812 ±plus-or-minus\pm± 0.093 0.853 ±plus-or-minus\pm± 0.112 0.821 ±plus-or-minus\pm± 0.036 76 / 70 60 / 70
E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.663 0.897 ±plus-or-minus\pm± 0.074 0.839 ±plus-or-minus\pm± 0.116 0.857 ±plus-or-minus\pm± 0.034 67 / 70 59 / 70
E3subscript𝐸3E_{3}italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.766 0.856 ±plus-or-minus\pm± 0.112 0.875 ±plus-or-minus\pm± 0.102 0.855 ±plus-or-minus\pm± 0.050 73 / 70 61 / 70
𝐄𝟒subscript𝐄4\mathbf{E_{4}}bold_E start_POSTSUBSCRIPT bold_4 end_POSTSUBSCRIPT 0.510 0.898 ±plus-or-minus\pm± 0.087 0.856 ±plus-or-minus\pm± 0.043 0.871 ±plus-or-minus\pm± 0.022 68 / 70 60 / 70
UAV E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.363 0.852 ±plus-or-minus\pm± 0.051 0.944 ±plus-or-minus\pm± 0.078 0.894 ±plus-or-minus\pm± 0.051 40 / 36 34 / 36
𝐄𝟐subscript𝐄2\mathbf{E_{2}}bold_E start_POSTSUBSCRIPT bold_2 end_POSTSUBSCRIPT 0.770 0.860 ±plus-or-minus\pm± 0.050 1.000 ±plus-or-minus\pm± 0.000 0.923 ±plus-or-minus\pm± 0.029 42 / 36 36 / 36
E3subscript𝐸3E_{3}italic_E start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.883 0.918 ±plus-or-minus\pm± 0.068 0.833 ±plus-or-minus\pm± 0.136 0.864 ±plus-or-minus\pm± 0.068 31 / 36 30 / 36
E4subscript𝐸4E_{4}italic_E start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 0.616 0.893 ±plus-or-minus\pm± 0.042 0.944 ±plus-or-minus\pm± 0.048 0.917 ±plus-or-minus\pm± 0.059 38 / 36 34 / 36
PURE E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.816 0.910 ±plus-or-minus\pm± 0.063 0.785 ±plus-or-minus\pm± 0.210 0.819 ±plus-or-minus\pm± 0.112 36 / 40 32 / 40
E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.816 0.893 ±plus-or-minus\pm± 0.076 0.785 ±plus-or-minus\pm± 0.210 0.809 ±plus-or-minus\pm± 0.102 37 / 40 32 / 40
𝐄𝟑subscript𝐄3\mathbf{E_{3}}bold_E start_POSTSUBSCRIPT bold_3 end_POSTSUBSCRIPT 0.883 0.902 ±plus-or-minus\pm± 0.070 0.896 ±plus-or-minus\pm± 0.073 0.896 ±plus-or-minus\pm± 0.044 40 / 40 36 / 40
E4subscript𝐸4E_{4}italic_E start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 0.710 0.958 ±plus-or-minus\pm± 0.058 0.785 ±plus-or-minus\pm± 0.210 0.841 ±plus-or-minus\pm± 0.123 34 / 40 32 / 40
IBM-UAV E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.583 0.708 ±plus-or-minus\pm± 0.212 0.550 ±plus-or-minus\pm± 0.324 0.557 ±plus-or-minus\pm± 0.178 24 / 28 16 / 28
E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.836 0.704 ±plus-or-minus\pm± 0.163 0.716 ±plus-or-minus\pm± 0.332 0.688 ±plus-or-minus\pm± 0.252 28 / 28 21 / 28
𝐄𝟑subscript𝐄3\mathbf{E_{3}}bold_E start_POSTSUBSCRIPT bold_3 end_POSTSUBSCRIPT 0.899 0.871 ±plus-or-minus\pm± 0.118 0.716 ±plus-or-minus\pm± 0.332 0.711 ±plus-or-minus\pm± 0.221 26 / 28 21 / 28
E4subscript𝐸4E_{4}italic_E start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 0.730 0.529 ±plus-or-minus\pm± 0.380 0.566 ±plus-or-minus\pm± 0.418 0.537 ±plus-or-minus\pm± 0.380 22 / 28 17 / 28

In the subsequent step of the S3CDA approach, we validate the potential conflicts identified in Phase I. This validation involves the application of entity extraction techniques, specifically POS tagging and S-NER, to the potential conflicts associated with the best embedding identified in Table 8. Notably, Phase II of the S3CDA approach is employed to validate conflicts through semantic analysis, focusing on the overlap** entities within potential conflicts.

Table 9 summarizes the results of this validation. Notably, the S-NER method demonstrates relatively consistent performance and effectively validates Correctly Detected Conflicts. Across both techniques, there is a trend of improved Precision scores accompanied by a decrease in Recall, resulting in an overall decline in F1-scores. Additionally, the use of a hard threshold (To=1subscript𝑇𝑜1T_{o}=1italic_T start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 1) leads to a reduction in Recall scores, while a more flexible threshold (To=0.75subscript𝑇𝑜0.75T_{o}=0.75italic_T start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 0.75) shows consistent or equivalent performance to Phase I for various metrics.

Table 9: Comparison of POS tagging and S-NER entity extraction techniques with To={1.0,0.75}subscript𝑇𝑜1.00.75T_{o}=\{1.0,0.75\}italic_T start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = { 1.0 , 0.75 }. Performance changes are highlighted in blue and red, indicating increments and decrements, respectively, with both absolute and relative percentage values. The baseline for comparison is established using the best embedding performance as identified in Table 8.
𝐓𝐨=𝟏subscript𝐓𝐨1\mathbf{T_{o}=1}bold_T start_POSTSUBSCRIPT bold_o end_POSTSUBSCRIPT = bold_1 𝐓𝐨=0.75subscript𝐓𝐨0.75\mathbf{T_{o}=0.75}bold_T start_POSTSUBSCRIPT bold_o end_POSTSUBSCRIPT = bold_0.75
Dataset Method P R F1 Correctly P R F1 Correctly
Detected Detected
OpenCoss POS 0.419 ±plus-or-minus\pm± 0.097 0.791 ±plus-or-minus\pm± 0.090 0.543 ±plus-or-minus\pm± 0.092 16 / 20 0.433 ±plus-or-minus\pm± 0.118 0.847 ±plus-or-minus\pm± 0.137 0.570 ±plus-or-minus\pm± 0.130 17 / 20
\downarrow -0.014 / 3.23% \downarrow -0.056 / 6.611% similar-to\sim similar-to\sim
S-NER 0.433 ±plus-or-minus\pm± 0.118 0.847 ±plus-or-minus\pm± 0.137 0.570 ±plus-or-minus\pm± 0.130 17 / 20 0.433 ±plus-or-minus\pm± 0.118 0.847 ±plus-or-minus\pm± 0.137 0.570 ±plus-or-minus\pm± 0.130 17 / 20
similar-to\sim similar-to\sim similar-to\sim similar-to\sim
WorldVista POS 0.925 ±plus-or-minus\pm± 0.104 0.327 ±plus-or-minus\pm± 0.063 0.480 ±plus-or-minus\pm± 0.076 23 / 70 0.922 ±plus-or-minus\pm± 0.078 0.825 ±plus-or-minus\pm± 0.077 0.864 ±plus-or-minus\pm± 0.019 58 / 70
\uparrow +0.027 / 3.00% \downarrow -0.548 / 62.62% \uparrow +0.024 / 2.67% \downarrow -0.050 / 5.71%
S-NER 0.904 ±plus-or-minus\pm± 0.098 0.756 ±plus-or-minus\pm± 0.055 0.817 ±plus-or-minus\pm± 0.018 53 / 70 0.898 ±plus-or-minus\pm± 0.087 0.856 ±plus-or-minus\pm± 0.043 0.871 ±plus-or-minus\pm± 0.022 60 / 70
\uparrow +0.006 / 0.66% \downarrow -0.119 / 23.60% similar-to\sim similar-to\sim
UAV POS 0.925 ±plus-or-minus\pm± 0.104 0.361 ±plus-or-minus\pm± 0.171 0.484 ±plus-or-minus\pm± 0.155 13 / 36 0.855 ±plus-or-minus\pm± 0.044 0.944 ±plus-or-minus\pm± 0.078 0.893 ±plus-or-minus\pm± 0.022 34 / 36
\uparrow +0.065 / 7.55% \downarrow -0.639 / 63.90% \downarrow -0.005 / 5.81% \downarrow -0.056 / 5.60%
S-NER 0.867 ±plus-or-minus\pm± 0.039 0.916 ±plus-or-minus\pm± 0.068 0.891 ±plus-or-minus\pm± 0.052 33 / 36 0.860 ±plus-or-minus\pm± 0.050 1.000 ±plus-or-minus\pm± 0.000 0.923 ±plus-or-minus\pm± 0.029 36 / 36
\uparrow +0.007 / 0.81% \downarrow -0.084 / 8.40% similar-to\sim similar-to\sim
PURE POS 0.850 ±plus-or-minus\pm± 0.108 0.603 ±plus-or-minus\pm± 0.124 0.702 ±plus-or-minus\pm± 0.118 24 / 40 0.897 ±plus-or-minus\pm± 0.075 0.869 ±plus-or-minus\pm± 0.102 0.879 ±plus-or-minus\pm± 0.068 35 / 40
\downarrow -0.052 / 5.76% \downarrow -0.293 / 32.70% \uparrow +0.001 / 0.11% \downarrow -0.027 / 3.01%
S-NER 0.891 ±plus-or-minus\pm± 0.078 0.750 ±plus-or-minus\pm± 0.087 0.807 ±plus-or-minus\pm± 0.035 30 / 40 0.902 ±plus-or-minus\pm± 0.070 0.896 ±plus-or-minus\pm± 0.073 0.896 ±plus-or-minus\pm± 0.044 36 / 40
\downarrow -0.011 / 1.21% \downarrow -0.146 / 16.29% \uparrow +0.006 / 0.66% similar-to\sim
IBM-UAV POS 0.909 ±plus-or-minus\pm± 0.128 0.583 ±plus-or-minus\pm± 0.239 0.661 ±plus-or-minus\pm± 0.186 17 / 28 0.864 ±plus-or-minus\pm± 0.128 0.683 ±plus-or-minus\pm± 0.306 0.694 ±plus-or-minus\pm± 0.213 20 / 28
\uparrow +0.038 / 4.36% \downarrow -0.133 / 18.57% \downarrow -0.007 / 0.80% \downarrow -0.033 / 4.60%
S-NER 0.914 ±plus-or-minus\pm± 0.068 0.583 ±plus-or-minus\pm± 0.311 0.659 ±plus-or-minus\pm± 0.226 17 / 28 0.871 ±plus-or-minus\pm± 0.118 0.716 ±plus-or-minus\pm± 0.332 0.711 ±plus-or-minus\pm± 0.221 21 / 28
\uparrow +0.043 / 4.93% \downarrow -0.133 / 18.57% similar-to\sim similar-to\sim

4.2 UnSupCDA Performance Evaluation

Table 10 outlines the assessment outcomes for the UnSupCDA approach. Overall, the approach demonstrates higher Recall values across all datasets, indicating its proficiency in capturing conflicts. However, Precision struggles, leading to lower F1-scores. This is attributed to the algorithm’s ability to identify conflicts, but it also includes false candidates, adversely impacting Precision.

Table 10: Results with the UnSupCDA for n={1,2,3}𝑛123n=\{1,2,3\}italic_n = { 1 , 2 , 3 } with various thresholds. Reported values are averaged over 3 folds and presented as “mean ±plus-or-minus\pm± std”.
𝐓𝐨=1.0subscript𝐓𝐨1.0\mathbf{T_{o}=1.0}bold_T start_POSTSUBSCRIPT bold_o end_POSTSUBSCRIPT = bold_1.0 𝐓𝐨=0.75subscript𝐓𝐨0.75\mathbf{T_{o}=0.75}bold_T start_POSTSUBSCRIPT bold_o end_POSTSUBSCRIPT = bold_0.75 𝐓𝐨=0.5subscript𝐓𝐨0.5\mathbf{T_{o}=0.5}bold_T start_POSTSUBSCRIPT bold_o end_POSTSUBSCRIPT = bold_0.5
Dataset P R F1 P R F1 P R F1
OpenCoss 0.178 ±plus-or-minus\pm± 0.023 1.000 ±plus-or-minus\pm± 0.000 0.302 ±plus-or-minus\pm± 0.032 0.170 ±plus-or-minus\pm± 0.020 1.000 ±plus-or-minus\pm± 0.000 0.290 ±plus-or-minus\pm± 0.030 0.170 ±plus-or-minus\pm± 0.020 1.000 ±plus-or-minus\pm± 0.000 0.290 ±plus-or-minus\pm± 0.030
WorldVista 0.511 ±plus-or-minus\pm± 0.055 0.656 ±plus-or-minus\pm± 0.136 0.565 ±plus-or-minus\pm± 0.056 0.510 ±plus-or-minus\pm± 0.029 0.972 ±plus-or-minus\pm± 0.039 0.667 ±plus-or-minus\pm± 0.015 0.472 ±plus-or-minus\pm± 0.010 1.000 ±plus-or-minus\pm± 0.000 0.641 ±plus-or-minus\pm± 0.009
UAV 0.330 ±plus-or-minus\pm± 0.019 0.805 ±plus-or-minus\pm± 0.039 0.468 ±plus-or-minus\pm± 0.024 0.336 ±plus-or-minus\pm± 0.004 1.000 ±plus-or-minus\pm± 0.000 0.503 ±plus-or-minus\pm± 0.005 0.310 ±plus-or-minus\pm± 0.003 1.000 ±plus-or-minus\pm± 0.000 0.473 ±plus-or-minus\pm± 0.004
PURE 0.752 ±plus-or-minus\pm± 0.110 0.821 ±plus-or-minus\pm± 0.050 0.783 ±plus-or-minus\pm± 0.083 0.741 ±plus-or-minus\pm± 0.126 1.000 ±plus-or-minus\pm± 0.000 0.845 ±plus-or-minus\pm± 0.086 0.605 ±plus-or-minus\pm± 0.026 1.000 ±plus-or-minus\pm± 0.000 0.753 ±plus-or-minus\pm± 0.020
IBM-UAV 0.458 ±plus-or-minus\pm± 0.097 0.700 ±plus-or-minus\pm± 0.141 0.553 ±plus-or-minus\pm± 0.112 0.380 ±plus-or-minus\pm± 0.077 0.766 ±plus-or-minus\pm± 0.205 0.508 ±plus-or-minus\pm± 0.115 0.291 ±plus-or-minus\pm± 0.008 1.000 ±plus-or-minus\pm± 0.000 0.451 ±plus-or-minus\pm± 0.009
(a) 𝐧=𝟏𝐧1\mathbf{n=1}bold_n = bold_1
𝐓𝐨=1.0subscript𝐓𝐨1.0\mathbf{T_{o}=1.0}bold_T start_POSTSUBSCRIPT bold_o end_POSTSUBSCRIPT = bold_1.0 𝐓𝐨=0.75subscript𝐓𝐨0.75\mathbf{T_{o}=0.75}bold_T start_POSTSUBSCRIPT bold_o end_POSTSUBSCRIPT = bold_0.75 𝐓𝐨=0.5subscript𝐓𝐨0.5\mathbf{T_{o}=0.5}bold_T start_POSTSUBSCRIPT bold_o end_POSTSUBSCRIPT = bold_0.5
Dataset P R F1 P R F1 P R F1
OpenCoss 0.176 ±plus-or-minus\pm± 0.020 1.000 ±plus-or-minus\pm± 0.000 0.299 ±plus-or-minus\pm± 0.029 0.170 ±plus-or-minus\pm± 0.020 1.000 ±plus-or-minus\pm± 0.000 0.290 ±plus-or-minus\pm± 0.030 0.170 ±plus-or-minus\pm± 0.020 1.000 ±plus-or-minus\pm± 0.000 0.290 ±plus-or-minus\pm± 0.03
WorldVista 0.500 ±plus-or-minus\pm± 0.000 0.772 ±plus-or-minus\pm± 0.107 0.604 ±plus-or-minus\pm± 0.034 0.493 ±plus-or-minus\pm± 0.013 0.972 ±plus-or-minus\pm± 0.039 0.654 ±plus-or-minus\pm± 0.003 0.472 ±plus-or-minus\pm± 0.010 1.000 ±plus-or-minus\pm± 0.000 0.641 ±plus-or-minus\pm± 0.009
UAV 0.329 ±plus-or-minus\pm± 0.020 0.833 ±plus-or-minus\pm± 0.068 0.472 ±plus-or-minus\pm± 0.031 0.315 ±plus-or-minus\pm± 0.000 1.000 ±plus-or-minus\pm± 0.000 0.480 ±plus-or-minus\pm± 0.000 0.310 ±plus-or-minus\pm± 0.003 1.000 ±plus-or-minus\pm± 0.000 0.473 ±plus-or-minus\pm± 0.004
PURE 0.748 ±plus-or-minus\pm± 0.115 0.849 ±plus-or-minus\pm± 0.011 0.791 ±plus-or-minus\pm± 0.072 0.677 ±plus-or-minus\pm± 0.106 1.000 ±plus-or-minus\pm± 0.000 0.802 ±plus-or-minus\pm± 0.073 0.605 ±plus-or-minus\pm± 0.026 1.000 ±plus-or-minus\pm± 0.000 0.753 ±plus-or-minus\pm± 0.020
IBM-UAV 0.413 ±plus-or-minus\pm± 0.101 0.700 ±plus-or-minus\pm± 0.141 0.518 ±plus-or-minus\pm± 0.115 0.325 ±plus-or-minus\pm± 0.086 0.766 ±plus-or-minus\pm± 0.205 0.456 ±plus-or-minus\pm± 0.121 0.279 ±plus-or-minus\pm± 0.022 1.000 ±plus-or-minus\pm± 0.000 0.436 ±plus-or-minus\pm± 0.027
(b) 𝐧=𝟐𝐧2\mathbf{n=2}bold_n = bold_2
𝐓𝐨=1.0subscript𝐓𝐨1.0\mathbf{T_{o}=1.0}bold_T start_POSTSUBSCRIPT bold_o end_POSTSUBSCRIPT = bold_1.0 𝐓𝐨=0.75subscript𝐓𝐨0.75\mathbf{T_{o}=0.75}bold_T start_POSTSUBSCRIPT bold_o end_POSTSUBSCRIPT = bold_0.75 𝐓𝐨=0.5subscript𝐓𝐨0.5\mathbf{T_{o}=0.5}bold_T start_POSTSUBSCRIPT bold_o end_POSTSUBSCRIPT = bold_0.5
Dataset P R F1 P R F1 P R F1
OpenCoss 0.175 ±plus-or-minus\pm± 0.021 1.000 ±plus-or-minus\pm± 0.000 0.297 ±plus-or-minus\pm± 0.031 0.170 ±plus-or-minus\pm± 0.020 1.000 ±plus-or-minus\pm± 0.000 0.290 ±plus-or-minus\pm± 0.030 0.170 ±plus-or-minus\pm± 0.020 1.000 ±plus-or-minus\pm± 0.000 0.290 ±plus-or-minus\pm± 0.030
WorldVista 0.510 ±plus-or-minus\pm± 0.014 0.800 ±plus-or-minus\pm± 0.069 0.621 ±plus-or-minus\pm± 0.011 0.489 ±plus-or-minus\pm± 0.008 0.972 ±plus-or-minus\pm± 0.039 0.650 ±plus-or-minus\pm± 0.004 0.472 ±plus-or-minus\pm± 0.010 1.000 ±plus-or-minus\pm± 0.000 0.641 ±plus-or-minus\pm± 0.009
UAV 0.318 ±plus-or-minus\pm± 0.020 0.833 ±plus-or-minus\pm± 0.068 0.461 ±plus-or-minus\pm± 0.030 0.313 ±plus-or-minus\pm± 0.003 1.000 ±plus-or-minus\pm± 0.000 0.476 ±plus-or-minus\pm± 0.004 0.310 ±plus-or-minus\pm± 0.003 1.000 ±plus-or-minus\pm± 0.000 0.473 ±plus-or-minus\pm± 0.004
PURE 0.712 ±plus-or-minus\pm± 0.090 0.849 ±plus-or-minus\pm± 0.011 0.772 ±plus-or-minus\pm± 0.059 0.661 ±plus-or-minus\pm± 0.086 1.000 ±plus-or-minus\pm± 0.000 0.793 ±plus-or-minus\pm± 0.061 0.605 ±plus-or-minus\pm± 0.026 1.000 ±plus-or-minus\pm± 0.000 0.753 ±plus-or-minus\pm± 0.020
IBM-UAV 0.413 ±plus-or-minus\pm± 0.101 0.700 ±plus-or-minus\pm± 0.141 0.518 ±plus-or-minus\pm± 0.115 0.322 ±plus-or-minus\pm± 0.091 0.799 ±plus-or-minus\pm± 0.216 0.458 ±plus-or-minus\pm± 0.126 0.279 ±plus-or-minus\pm± 0.022 1.000 ±plus-or-minus\pm± 0.000 0.436 ±plus-or-minus\pm± 0.027
(c) 𝐧=𝟑𝐧3\mathbf{n=3}bold_n = bold_3

As anticipated, employing a hard threshold (T0=1subscript𝑇01T_{0}=1italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1) results in consistently low Precision for all n𝑛nitalic_n values. Conversely, a stable threshold (To=0.75subscript𝑇𝑜0.75T_{o}=0.75italic_T start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 0.75) shows improved Precision and Recall compared to To=1subscript𝑇𝑜1T_{o}=1italic_T start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 1. A soft threshold (To=0.5subscript𝑇𝑜0.5T_{o}=0.5italic_T start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 0.5) leads to the selection of numerous false conflicts, diminishing Precision. This pattern is consistent across all n𝑛nitalic_n values.

Contrary to expectations, increasing n𝑛nitalic_n from 1111 to 3333 does not yield enhanced performance scores. The assumption was that a higher n𝑛nitalic_n would improve scores, considering it represents the number of similar requirements considered for conflict assessment. However, the analysis reveals comparable performance across different thresholds for each n𝑛nitalic_n. For instance, in WorldVista, with To=0.75subscript𝑇𝑜0.75T_{o}=0.75italic_T start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 0.75 and n={1,2,3}𝑛123n=\{1,2,3\}italic_n = { 1 , 2 , 3 }, Recall remains nearly 100%, while F1-scores decrease (66.7%, 65.4%, and 65.0%, respectively). Similar trends are observed in UAV and PURE datasets, where Recall is consistently 100%, but F1-scores decline with increasing n𝑛nitalic_n. Notably, OpenCoss displays the lowest F1-score and Precision, maintaining 100% Recall, resembling the performance observed in the S3CDA approach.

4.3 Discussion

Table 11 outlines key comparisons between the proposed techniques. S3CDA exhibits superior overall performance, especially evident in conflict F1-score assessment. On the other hand, UnSupCDA excels in capturing true conflicts without the need for labeled training data, showcasing higher Recall. The two techniques each have distinct advantages and are applicable in different scenarios. Notably, UnSupCDA boasts versatility, making it easily applicable to SRS documents across various domains. In contrast, S3CDA’s performance is influenced by the specific characteristics, structure, and word usage within the requirements.

Table 11: Comparison of the proposed techniques
S3CDA UnSupCDA
Applicability Well-suited for scenarios with reliable labeled data Applicable when labeled data for conflicts is scarce or unavailable
Dataset Require labeled data Requires known conflicts for evaluation
Requirements More efficient with functional requirements Work with both functional and non-functional
Human Expert Low false predictions, limited human expertise required Generates false predictions, demands additional software expertise
Complexity Higher as it involves 2 stages Lower
Experiment Results Balanced Precision and Recall i.e, higher F1-scores Higher Recall and lower Precision i.e., lower F1-scores

5 Threats to Validity

Several potential threats to the validity of our proposed conflict detection approaches in software requirements merit consideration. Firstly, the reliance on manual labeling and the introduction of synthetic conflicts may impact the generalizability of our findings. The effectiveness of our method could be influenced by the specifics introduced during manual labeling and the nature of synthetic conflicts, potentially limiting the broader applicability of the model. Additionally, the accuracy of the software-specific NER model poses a validity threat, as its performance directly influences the identification of entities within software requirements. Any inaccuracies in entity extraction may lead to false positives or negatives in conflict detection.

Furthermore, the validity of our approach hinges on the efficacy of the sentence embedding models in accurately embedding software requirements. Variations in the structure, vocabulary, or complexity of requirements could impact the algorithm ability to discern meaningful similarities, affecting the overall success of conflict detection.

6 Conclusion

This study seeks to identify conflicts within software engineering requirements, recognizing their potential to significantly impede project success by causing delays throughout the entire development process. Prior research exhibits limitations in terms of generalizability, comprehensive requirement datasets, and clearly defined NLP-based automated methodologies. We develop a two-phase supervised (S3CDA) process for automatic conflict detection from SRS documents and an unsupervised variant (UnSupCDA) which works directly on natural language-based requirements.

Our experimental design aims to evaluate the performance of the S3CDA and UnSupCDA methodologies using five distinct SRS documents. The results demonstrate the effective conflict detection capabilities of both approaches across four datasets, with OpenCoss being an exception due to elevated structural similarities among requirements. S3CDA exhibits balanced F1-scores, while UnSupCDA attains 100% Recall across all datasets, albeit encountering challenges in Precision scores.

In future, we plan to extend this study by introducing more diverse SRS documents to further validate our proposed approach. We also concur that NLP domain is highly dynamic and new methods (e.g., embeddings) are developed at a fast pace. In this regard, we aim to extend our analysis by using other transformer-based sentence embeddings. Similarly, transformer-based NER models can be explored to improve the entity extraction performance.

Statements and Declarations

No potential conflict of interest was reported by the authors.

Data Availability Statement

Full research data can be accessible through following link.
https://gitfront.io/r/user-9946871/tiYp38xXphd7/Paper-1-Req-Conflict/

References

  • Shah and **wala [2015] U. S. Shah, D. C. **wala, Resolving ambiguities in natural language software requirements: a comprehensive survey, ACM SIGSOFT Software Engineering Notes 40 (2015) 1–7.
  • Osman and Zaharin [2018] M. H. Osman, M. F. Zaharin, Ambiguous software requirement specification detection: An automated approach, in: 2018 IEEE/ACM 5th International Workshop on Requirements Engineering and Testing (RET), IEEE, 2018, pp. 33–40.
  • Aldekhail et al. [2016] M. Aldekhail, A. Chikh, D. Ziani, Software requirements conflict identification: review and recommendations, Int J Adv Comput Sci Appl (IJACSA) 7 (2016) 326.
  • Egyed and Grunbacher [2004] A. Egyed, P. Grunbacher, Identifying requirements conflicts and cooperation: How quality attributes and automated traceability can help, IEEE software 21 (2004) 50–58.
  • Merugu and Chinnam [2021] R. Merugu, S. R. Chinnam, Automated cloud service based quality requirement classification for software requirement specification, Evolutionary Intelligence 14 (2021) 389–394.
  • Zhao et al. [2021] L. Zhao, W. Alhoshan, A. Ferrari, K. J. Letsholo, M. A. Ajagbe, E.-V. Chioasca, R. T. Batista-Navarro, Natural language processing for requirements engineering: A systematic map** study, ACM Computing Surveys (CSUR) 54 (2021) 1–41.
  • Ezzini et al. [2021] S. Ezzini, S. Abualhaija, C. Arora, M. Sabetzadeh, L. C. Briand, Using domain-specific corpora for improved handling of ambiguity in requirements, in: 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), IEEE, 2021, pp. 1485–1497.
  • Zhou et al. [2016] Y. Zhou, Y. Tong, R. Gu, H. Gall, Combining text mining and data mining for bug report classification, Journal of Software: Evolution and Process 28 (2016) 150–176.
  • Aggarwal et al. [2017] K. Aggarwal, F. Timbers, T. Rutgers, A. Hindle, E. Stroulia, R. Greiner, Detecting duplicate bug reports with software engineering domain knowledge, Journal of Software: Evolution and Process 29 (2017) e1821.
  • Guo et al. [2021] W. Guo, L. Zhang, X. Lian, Automatically detecting the conflicts between software requirements based on finer semantic analysis, arXiv preprint arXiv:2103.02255 (2021).
  • Diamantopoulos et al. [2017] T. Diamantopoulos, M. Roth, A. Symeonidis, E. Klein, Software requirements as an application domain for natural language processing, Language Resources and Evaluation 51 (2017) 495–524.
  • Sabriye and Zainon [2017] A. O. J. Sabriye, W. M. N. W. Zainon, A framework for detecting ambiguity in software requirement specification, in: 2017 8th International Conference on Information Technology (ICIT), IEEE, 2017, pp. 209–213.
  • Mairiza et al. [2009] D. Mairiza, D. Zowghi, N. Nurmuliani, Managing conflicts among non-functional requirements, in: Australian Workshop on Requirements Engineering, University of Technology, Sydney, 2009.
  • Kim et al. [2007] M. Kim, S. Park, V. Sugumaran, H. Yang, Managing requirements conflicts in software product lines: A goal and scenario based approach, Data & Knowledge Engineering 61 (2007) 417–432.
  • Butt et al. [2011] W. H. Butt, S. Amjad, F. Azam, Requirement conflicts resolution: using requirement filtering and analysis, in: International Conference on Computational Science and Its Applications, Springer, 2011, pp. 383–397.
  • Moser et al. [2011] T. Moser, D. Winkler, M. Heindl, S. Biffl, Requirements management with semantic technology: An empirical study on automated requirements categorization and conflict analysis, in: International Conference on Advanced Information Systems Engineering, Springer, 2011, pp. 3–17.
  • Aldekhail and Almasri [2022] M. Aldekhail, M. Almasri, Intelligent identification and resolution of software requirement conflicts: Assessment and evaluation., Computer Systems Science & Engineering 40 (2022).
  • Viana et al. [2017] T. Viana, A. Zisman, A. K. Bandara, Identifying conflicting requirements in systems of systems, in: 2017 IEEE 25th International Requirements Engineering Conference (RE), IEEE, 2017, pp. 436–441.
  • Urbieta et al. [2012] M. Urbieta, M. J. Escalona, E. Robles Luna, G. Rossi, Detecting conflicts and inconsistencies in web application requirements, in: Current Trends in Web Engineering: Workshops, Doctoral Symposium, and Tutorials, Held at ICWE 2011, Paphos, Cyprus, June 20-21, 2011. Revised Selected Papers 11, Springer, 2012, pp. 278–288.
  • Moser et al. [2011] T. Moser, D. Winkler, M. Heindl, S. Biffl, Automating the detection of complex semantic conflicts between software requirements, in: The 23rd International Conference on Software Engineering and Knowledge Engineering, Miami, 2011.
  • Kamalrudin et al. [2010] M. Kamalrudin, J. Grundy, J. Hosking, Managing consistency between textual requirements, abstract interactions and essential use cases, in: 2010 IEEE 34th Annual Computer Software and Applications Conference, IEEE, 2010, pp. 327–336.
  • Felty and Namjoshi [2003] A. P. Felty, K. S. Namjoshi, Feature specification and automated conflict detection, ACM Transactions on Software Engineering and Methodology (TOSEM) 12 (2003) 3–27.
  • Heisel and Souquieres [2001] M. Heisel, J. Souquieres, A heuristic algorithm to detect feature interactions in requirements, in: Language Constructs for Describing Features: Proceedings of the FIREworks workshop, Springer, 2001, pp. 143–162.
  • Manning et al. [2014] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, D. McClosky, The stanford corenlp natural language processing toolkit, in: Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, stanford, 2014, pp. 55–60.
  • Shah et al. [2021] U. Shah, S. Patel, D. C. **wala, Detecting intra-conflicts in non-functional requirements, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 29 (2021) 435–461.
  • Abeba and Alemneh [2021] G. Abeba, E. Alemneh, Identification of nonfunctional requirement conflicts: Machine learning approach, in: International Conference on Advances of Science and Technology, Springer, 2021, pp. 435–445.
  • Chentouf [2014] Z. Chentouf, Managing oam&p requirement conflicts, Journal of King Saud University-Computer and Information Sciences 26 (2014) 296–307.
  • Mairiza et al. [2013] D. Mairiza, D. Zowghi, V. Gervasi, Conflict characterization and analysis of non functional requirements: An experimental approach, in: 2013 IEEE 12th International Conference on Intelligent Software Methodologies, Tools and Techniques (SoMeT), IEEE, 2013, pp. 83–91.
  • Sadana and Liu [2007] V. Sadana, X. F. Liu, Analysis of conflicts among non-functional requirements using integrated analysis of functional and non-functional requirements, in: 31st annual international computer software and applications conference (COMPSAC 2007), volume 1, IEEE, 2007, pp. 215–218.
  • Das et al. [2021] S. Das, N. Deb, A. Cortesi, N. Chaki, Sentence embedding models for similarity detection of software requirements, SN Computer Science 2 (2021) 1–11.
  • Cleland-Huang et al. [2018] J. Cleland-Huang, M. Vierhauser, S. Bayley, Dronology: An incubator for cyber-physical system research, arXiv preprint arXiv:1804.02423 (2018).
  • Mavin et al. [2009] A. Mavin, P. Wilkinson, A. Harwood, M. Novak, Easy approach to requirements syntax (ears), in: 2009 17th IEEE International Requirements Engineering Conference, IEEE, 2009, pp. 317–322.
  • Ferrari et al. [2017] A. Ferrari, G. O. Spagnolo, S. Gnesi, Pure: A dataset of public requirements documents, in: 2017 IEEE 25th International Requirements Engineering Conference (RE), IEEE, 2017, pp. 502–505.
  • Haskins et al. [2006] C. Haskins, K. Forsberg, M. Krueger, D. Walden, D. Hamelin, Systems engineering handbook, in: INCOSE, volume 9, International Council on Systems Engineering Seattle, 2006, pp. 13–16.
  • Aizawa [2003] A. Aizawa, An information-theoretic perspective of tf–idf measures, Information Processing & Management 39 (2003) 45–65.
  • Cer et al. [2018] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. John, N. Constant, M. Guajardo-Céspedes, S. Yuan, C. Tar, et al., Universal sentence encoder, arXiv preprint arXiv:1803.11175 (2018).
  • Reimers and Gurevych [2019] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, arXiv preprint arXiv:1908.10084 (2019).
  • McInnes et al. [2018] L. McInnes, J. Healy, J. Melville, Umap: Uniform manifold approximation and projection for dimension reduction, arXiv preprint arXiv:1802.03426 (2018).
  • Gao and Wu [2018] X. Gao, S. Wu, Hierarchical clustering algorithm for binary data based on cosine similarity, in: 2018 8th International Conference on Logistics, Informatics and Service Sciences (LISS), IEEE, 2018, pp. 1–6.
  • Rupp et al. [2009] C. Rupp, M. Simon, F. Hocker, Requirements engineering und management, HMD Praxis der Wirtschaftsinformatik 46 (2009) 94–103.
  • Malik et al. [2022] G. Malik, M. Cevik, S. Bera, S. Yildirim, D. Parikh, A. Basar, Software requirement-specific entity extraction using transformer models, in: The 35th Canadian Conference on Artificial Intelligence, 2022.

7 Appendix

7.1 Overlap** Entity Ratio v/s Cosine Similarity

This section presents additional scatter plots related to Subsection 3.3.1.

Refer to caption
(a) POS Tagging
Refer to caption
(b) SNER
Figure 5: Scatterplot depicting the relationship between Cosine Similarity and Overlap** Entity Ratio among UAV requirements.
Refer to caption
(a) POS Tagging
Refer to caption
(b) SNER
Figure 6: Scatterplot depicting the relationship between Cosine Similarity and Overlap** Entity Ratio among PURE requirements.
Refer to caption
(a) POS Tagging
Refer to caption
(b) SNER
Figure 7: Scatterplot depicting the relationship between Cosine Similarity and Overlap** Entity Ratio among IBM-UAV requirements.

7.2 ROC Curves for Threshold Detection

Figure 8,9,10, and 11 shows the ROC curves obtained from the 3-fold cross validation over all the requirement sets. These curves facilitate the process of finding the cosine similarity threshold in Phase I.

Refer to caption
(a) Fold - 1
Refer to caption
(b) Fold - 2
Refer to caption
(c) Fold - 3
Figure 8: ROC curves for OpenCoss requirement set with TFIDF embedding across 3 folds
Refer to caption
(a) Fold - 1
Refer to caption
(b) Fold - 2
Refer to caption
(c) Fold - 3
Figure 9: ROC curves for WorldVista requirement set with SBERT-TFIDF embedding across 3 folds
Refer to caption
(a) Fold - 1
Refer to caption
(b) Fold - 2
Refer to caption
(c) Fold - 3
Figure 10: ROC curves for PURE requirement set with SBERT embedding across 3 folds
Refer to caption
(a) Fold - 1
Refer to caption
(b) Fold - 2
Refer to caption
(c) Fold - 3
Figure 11: ROC curves for IBM-UAV requirement set with SBERT embedding across 3 folds