Ensuring Responsible Sourcing of Large Language Model Training Data Through Knowledge Graph Comparison

Devam Mondal
School of Systems and Enterprises, SIT
Hoboken, United States
[email protected] \AndCarlo Lipizzi
School of Systems and Enterprises, SIT
Hoboken, United States
[email protected]
Abstract

In light of recent plagiarism allegations brought by publishers, newspapers, and other creators of copyrighted corpora against large language model (LLM) developers, we propose a novel system, a variant of a plagiarism detection system, that assesses whether a knowledge source has been used in the training or fine-tuning of a large language model. Unlike current methods, we utilize an approach that uses Resource Description Framework (RDF) triples to create knowledge graphs from both a source document and a LLM continuation of that document. These graphs are then analyzed with respect to content using cosine similarity and with respect to structure using a normalized version of graph edit distance that shows the degree of isomorphism. Unlike traditional systems that focus on content matching and keyword identification between a source and target corpus, our approach enables a broader evaluation of similarity and thus a more accurate comparison of the similarity between a source document and LLM continuation by focusing on relationships between ideas and their organization with regards to others. Additionally, our approach does not require access to LLM metrics like perplexity that may be unavailable in closed large language modeling “black-box” systems, as well as the training corpus. A prototype of our system will be found on a hyperlinked GitHub repository.

1 Introduction

Large language models (LLMs) have demonstrated their impressive ability to capture and replicate human language, generating large volumes of text to human-entered prompts. However, this capability exists due to the vast corpora of text these models are trained on. With models exceeding millions of parameters and continuing to grow, they require large volumes of information for training purposes. Sourcing this information, however, presents legal issues due to concerns over the copyright of such corpora. Many argue that copyrighted materials acquired through web-scra** cannot be used in training large language models due to the possibility such information is regurgitated when producing a result Chang et al. (2023). Others argue that the usage of copyrighted materials in training is acceptable because the results the large language model produces are derivative works of the source information.

Recently, allegations by The New York Times against OpenAI over using its copyrighted content in training ChatGPT brings greater relevance to such an issue Guo et al. (2024). This paper hopes to tackle this issue by providing a mechanism that assesses whether a piece of content has been used to train/fine-tune a large language model.

Our approach considers the source document (the knowledge base one wishes to see if an LLM has been trained/fine-tuned on), and an LLM continuation of that document when provided the first sentence. We then create RDF (Resource Description Framework) triples in a [subject, predicate, object] format for the source document and the continuation, then make respective knowledge graphs. Then, the cosine similarity of each one-edge walk (a graphical interpretation corresponding to an RDF triple, where the start vertex is the subject, the edge is the predicate, and the end vertex is the object, as demonstrated in Figure 2) of the continuation knowledge graph is compared to each one-edge walk (an RDF triple) of the source knowledge graph. However, since both the source knowledge graph and continuation knowledge graph share their first RDF triple and thus one-edge walk (as they both have the same first sentence), we ignore this triple from the continuation knowledge graph during the comparison. Based on a threshold, if the cumulative cosine similarity is high, there is strong evidence that the source document was used in the training/fine-tuning of the LLM. We also consider the structure of the two graphs by considering their degree of isomorphism through a normalized graph edit distance metric that considers how many alterations must be made to transform the source knowledge graph into the continuation knowledge graph.

Our method addresses limitations with closed, “black box” large language modeling systems (systems like ChatGPT where metrics of the LLM, as well as training data, are unavailable due to abstraction by the developers) by only considering the outputs of an LLM. Our method also addresses limitations that exist with traditional plagiarism systems (systems that look for similarity between corpora) that utilize direct keyword and content matching by also considering broad relationships between ideas (done so through the aforementioned one-edge walk comparisons) and their presentation/organization with regards to others (done so through assessing the structure of the graph via the degree of isomorphism measured through normalized graph edit distance)

2 Literature Review

There exists a large range of literature that addresses identifying large language model training data in a “black-box” environment where the training corpus is unknown. For instance, LLM training data sourcing has been assessed through min-k% prob, which is a detection method based on the assumption that a member of the training data is less likely to include words that have high negative log-likelihood (and are thus outlier words) compared to a non-member of the training data Shi et al. (2024), therefore considering "anomalous" vocabulary within a text.

Such an approach, based on the principles of Membership Interference Attacks (MIAs), an adversarial technique that seeks to determine whether a knowledge source is part of a model’s training data, is the most common method to identify LLM training data in “black-box” environments. Substantial literature also exists about utilizing MIA principles to identify corpora used to fine-tune LLMs, addressing word embeddings Mahloujifar et al. (2021), addressing NLP classification models for members of training corpora Shejwalkar et al. (2021), and addressing source text memorization Song and Shmatikov (2019).

However, such approaches based on MIA principles take a statistical and probabilistic approach to identifying LLM training data, ignoring other “signs” of sourcing that extend beyond simple copying or paraphrasing. Statistical measures such as only considering the likelihood of "anomalous" words ignore the broad relationships between ideas that exist in sentences of a source corpora that may manifest themselves in an LLM’s generated answer.

Additionally, traditional plagiarism detection systems (systems that compare the similarity of multiple corpora) often rely on simple matching techniques. For instance, these systems may search direct token (word, sentence, unique phrase, paragraph, etc.) matches between a document and others, using a threshold for matches as an indicator of plagiarism/similarity. Other systems narrow down at the individual word/phrase level, analyzing semantic relationships through simple synonym/antonym detection or more complex Semantic Role Labeling techniques between words in target and source sentences Osman et al. (2012). Other systems use character-based n-gram analysis to detect similarity between a target and source corpus Bensalem et al. (2014). However, such systems fail to look at similarities in broad idea organization and content structure between a source and target text. They thus do not account for the fact that plagiarism can occur at a high level with regards to ideas and the way they are organized in a text in addition to singular sentence/phrase/word copying. Furthermore, some plagiarism systems utilize Deep Neural Networks in order to assess the similarity between a target and source corpus El-Rashidy et al. (2022), Hambi and Benabbou (2020). Such systems pose advantages such as the ability to be used with a variety of corpora due to transfer learning concepts such as layer freezing that enables fine-tuning. Furthermore, these systems are able to learn more semantically complex features from the given corpora due to the presence of multiple deep layers. However, due to the black-box nature of such an approach, it is hard to trace what aspects of the target and source match the greatest.

Therefore, we hope to augment work in this field by considering knowledge graphs and their ability to model relationships in a transparent way, creating a variant of a plagiarism detection system that can indicate whether a document was used in the training/fine-tuning of an LLM by comparing the similarity between broad ideas present in a source document and an LLM continuation, as well as their organization. Knowledge graphs address the aforementioned limitations with a focus on ideas that extend beyond simple semantic comparison while eliminating a black-box approach to plagiarism detection through easy visualization.

3 Approach

3.1 Establishing the Ground Truth

The first part of the system is a source document, the corpus that the LLM is suspected to be trained/fine-tuned on. This is the ground truth for the system and serves as a base for all measurements. To convert this document to a knowledge graph, we first extract RDF triples from the corpus because of their ability to capture complex relationships that encompass the main idea(s) of a sentence. These RDF triples are organized in [subject, predicate, object] format. A knowledge graph GSsubscript𝐺𝑆G_{S}italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT can then be produced from a series of RDF triples by looking for a common subject, establishing that as a main node, and then creating edges from that node that correspond to predicates, and then creating additional end nodes that are the objects.

For example, suppose sentence S𝑆Sitalic_S

“A planning process is critical for organizations, assists individuals and benefits society”

is part of the source document. Using a ChatGPT prompt (found in Appendix A) to extract RDF triples produces the following list:

["planning", "is critical for", "organizations"], ["planning", "assists", "individuals"], ["planning", "benefits", "society"]

From here, by establishing the subject “planning” as the central node of the knowledge graph, each edge maps to each unique predicate (“is critical for”, “assists”, “benefits”). Each one-edge walk from the central node thus leads to new end nodes that correspond to each unique object (“organizations”, “individuals”, “society”) respectively. Figure 1 demonstrates this process.

Refer to caption
Figure 1: Ground truth knowledge graph generation from a source document using RDF triples.

3.2 Exposing LLM Understanding of the Ground Truth Using a Continuation

The second part of this system is obtaining the LLM continuation and generating the continuation knowledge graph. Rather than “quiz” the LLM over content in the source document, we choose to evaluate the continuation to expose the LLM’s thought process with regard to the organization, phrasing, and selection of ideas when given the first sentence of the source document. Furthermore, we believe that generating a continuation is a far more challenging and more insightful test of understanding compared to simple fact retrieval. For instance, if the continuation contains a unique higher-order thought derived through synthesis that is present in the source document, there is a significant chance that the model was trained/fine-tuned on the source document. Simple quizzing would not enable the study of such "emergent properties," ignoring synthesized thoughts LLMs produce through a focus on simple knowledge gathering.

Given source document S𝑆{S}italic_S containing sentences S0,S1,S2Ssubscript𝑆0subscript𝑆1subscript𝑆2𝑆S_{0},S_{1},S_{2}…\in Sitalic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … ∈ italic_S, we provide the LLM the first sentence of the source document (S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) and ask it for a continuation that is |S|1𝑆1|S|-1| italic_S | - 1 sentences long using a ChatGPT prompt (found in Appendix B). The first sentence is then concatenated to the continuation, producing a complete continuation C𝐶Citalic_C, and the process detailed in Section 3.1 is applied to C to produce knowledge graph GCsubscript𝐺𝐶G_{C}italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT.

Now, if the LLM continuation has a similar organization, flow of ideas, as well as choice of ideas in an RDF format as the original document, there is a strong likelihood the document was used in training/fine-tuning the LLM. We assess the degree of similarity in flow and choice of ideas through content comparison of the GSsubscript𝐺𝑆G_{S}italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and GCsubscript𝐺𝐶G_{C}italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT using cosine similarity. We assess the degree of similarity in organization through the normalized graph edit distance metric that assesses the degree of isomorphism.

Refer to caption
Figure 2: One-edge walks of a knowledge graph.

3.3 Assessing Training Usage Through Content

Once GSsubscript𝐺𝑆G_{S}italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and GCsubscript𝐺𝐶G_{C}italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT are created, we consider all one-edge walks of each knowledge graph. Figure 2 shows these one-edge walks in a sample knowledge graph. These walks correspond to the RDF triples created from the source document and the continuation. Both sets of walks, whose vertices V𝑉{V}italic_V are {subject, object} and edge E𝐸Eitalic_E is just {predicate}, are then vectorized to embeddings. The embeddings for GSsubscript𝐺𝑆G_{S}italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT are then stored in a vector database.

From here, we ignore the vectorized walk corresponding to the first sentence of the source document in GCsubscript𝐺𝐶G_{C}italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT (as both the source document and continuation share this sentence). We instead take all the other vectorized walks from GCsubscript𝐺𝐶G_{C}italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT and find the top three matches with all vectorized walks from GSsubscript𝐺𝑆G_{S}italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT with respect to cosine similarity. We choose cosine similarity compared to other similarity metrics (Manhattan, Euclidean, etc.) because it is bounded and is consistently used in other literature when dealing with word embeddings. The similarities for these three matches that are above a user-defined threshold are then added to provide the total similarity Wcosθsubscript𝑊𝜃W_{\cos\theta}italic_W start_POSTSUBSCRIPT roman_cos italic_θ end_POSTSUBSCRIPT for that vectorized walk. Formally, if WGCsubscript𝑊subscript𝐺𝐶W_{G_{C}}italic_W start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the vectorized walk from GCsubscript𝐺𝐶G_{C}italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT, and m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, m2subscript𝑚2m_{2}italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and m3subscript𝑚3m_{3}italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT are the top three matches from GSsubscript𝐺𝑆G_{S}italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT (vectorized walks with greatest cosine similarity to WGCsubscript𝑊subscript𝐺𝐶W_{G_{C}}italic_W start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT), under the assumption that all three similarities are over the user-defined threshold, the total similarity (Wcosθsubscript𝑊𝜃W_{\cos\theta}italic_W start_POSTSUBSCRIPT roman_cos italic_θ end_POSTSUBSCRIPT) is:

WGCm1|WGC||m1|+WGCm2|WGC||m2|+WGCm3|WGC||m3|subscript𝑊subscript𝐺𝐶subscript𝑚1subscript𝑊subscript𝐺𝐶subscript𝑚1subscript𝑊subscript𝐺𝐶subscript𝑚2subscript𝑊subscript𝐺𝐶subscript𝑚2subscript𝑊subscript𝐺𝐶subscript𝑚3subscript𝑊subscript𝐺𝐶subscript𝑚3\frac{W_{G_{C}}\cdot m_{1}}{|W_{G_{C}}||m_{1}|}+\frac{W_{G_{C}}\cdot m_{2}}{|W% _{G_{C}}||m_{2}|}+\frac{W_{G_{C}}\cdot m_{3}}{|W_{G_{C}}||m_{3}|}divide start_ARG italic_W start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG | italic_W start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | end_ARG + divide start_ARG italic_W start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG | italic_W start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | end_ARG + divide start_ARG italic_W start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⋅ italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG start_ARG | italic_W start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT | | italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | end_ARG (1)

Note that there might be situations when an incompatible vectorized walk from GCsubscript𝐺𝐶G_{C}italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT only produces two, one, or zero matches. In these situations, Equation 1 reduces to two, one, or zero terms respectively.

This process is repeated for all vectorized walks in GCsubscript𝐺𝐶G_{C}italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT, with every walk’s total similarity adding to the graph’s cumulative similarity, which is then divided by the total number of walks in GCsubscript𝐺𝐶G_{C}italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT to yield the total average similarity GCcosθsubscript𝐺subscript𝐶𝜃G_{C_{\cos\theta}}italic_G start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_cos italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

From here, we set a threshold for GCcosθsubscript𝐺subscript𝐶𝜃G_{C_{\cos\theta}}italic_G start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_cos italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT based on two factors.

  1. 1.

    The minimum required cosine similarity between WGCsubscript𝑊subscript𝐺𝐶W_{G_{C}}italic_W start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT and m1subscript𝑚1m_{1}italic_m start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, m2subscript𝑚2m_{2}italic_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and m3subscript𝑚3m_{3}italic_m start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT for all three WGSsubscript𝑊subscript𝐺𝑆W_{G_{S}}italic_W start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT end_POSTSUBSCRIPT walks for them to be considered matches, defined as mincosθ𝑚𝑖subscript𝑛𝜃min_{\cos\theta}italic_m italic_i italic_n start_POSTSUBSCRIPT roman_cos italic_θ end_POSTSUBSCRIPT. This threshold was mentioned earlier in the presentation of Equation 1.

  2. 2.

    The total number of vectorized walks from GSsubscript𝐺𝑆G_{S}italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT that should match with vectorized walks from GCsubscript𝐺𝐶G_{C}italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT, defined mtsubscript𝑚𝑡m_{t}italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

We chose these two factors because they account for sentence-specific similarity (demonstrated by mincosθ𝑚𝑖subscript𝑛𝜃min_{\cos\theta}italic_m italic_i italic_n start_POSTSUBSCRIPT roman_cos italic_θ end_POSTSUBSCRIPT) as well as broader topic-wise similarity (demonstrated by mtsubscript𝑚𝑡m_{t}italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) between the source document and the continuation. These two factors are up to the user and are arbitrary values based on the use case.

We therefore define the threshold for similarity simGS,GC𝑠𝑖subscript𝑚subscript𝐺𝑆subscript𝐺𝐶sim_{G_{S},G_{C}}italic_s italic_i italic_m start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT as mincosθ𝑚𝑖subscript𝑛𝜃min_{\cos\theta}italic_m italic_i italic_n start_POSTSUBSCRIPT roman_cos italic_θ end_POSTSUBSCRIPT * mtsubscript𝑚𝑡m_{t}italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. If GCcosθ>simGS,GCsubscript𝐺subscript𝐶𝜃𝑠𝑖subscript𝑚subscript𝐺𝑆subscript𝐺𝐶G_{C_{\cos\theta}}>sim_{G_{S},G_{C}}italic_G start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_cos italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT > italic_s italic_i italic_m start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT, there is strong evidence that the source document has been used to train/fine-tune the model.

3.4 Assessing Training Usage Through Content: Nonsensical Example

To provide an example of this component of the system, we highlight a situation where there is an incompatible vectorized walk from GCsubscript𝐺𝐶G_{C}italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT. Consider the following source document S𝑆Sitalic_S:

“Supply chains enable individuals to be more efficient. Supply chains fuel business decisions.”

From here, GSsubscript𝐺𝑆G_{S}italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT thus consists of the following RDF triples and thus one-edge walks:

["supply chains", "enable", "efficient individuals"], ["supply chains", "fuel", "business decisions"]

Now, consider the following continuation, generated by an LLM with the first sentence of the source document concatenated at the beginning:

“Supply chains enable individuals to be more efficient. Supply chains play soccer.”

The RDF triples generated for the LLM continuation that make up the one-edge walks for GCsubscript𝐺𝐶G_{C}italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT are the following:

["supply chains", "enable", "efficient individuals"], ["supply chains", "play", "soccer"]

When assessing similarity, we only take the second RDF triple from GCsubscript𝐺𝐶G_{C}italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT (as both GSsubscript𝐺𝑆G_{S}italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and GCsubscript𝐺𝐶G_{C}italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT share the first triple). Because there only exist two RDF triples/one-edge walks for GSsubscript𝐺𝑆G_{S}italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT, there are only two top matches. However, because the second RDF triple from GCsubscript𝐺𝐶G_{C}italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT is nonsensical, there exists 0 top matches given a set of reasonably high thresholds. Therefore, the walk’s total similarity WGCsubscript𝑊subscript𝐺𝐶W_{G_{C}}italic_W start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT is 0, and the graph’s cumulative similarity is 0. This result makes sense, given the completely nonsensical continuation generated. Figure 3 provides a visual of this comparison process.

Refer to caption
Figure 3: A visual of the comparison process in assessing similarity through content.

If the second RDF triple from GCsubscript𝐺𝐶G_{C}italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT was instead [“supply chains”, “provide”, “resources”], there would exist two top matches given its relative increased relevance to all the RDF triples/one-edge walks for GSsubscript𝐺𝑆G_{S}italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT. Therefore, there would be positive total similarity and cumulative similarity. If this cumulative GCcosθsubscript𝐺subscript𝐶𝜃G_{C_{\cos\theta}}italic_G start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_cos italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT is above simGS,GC𝑠𝑖subscript𝑚subscript𝐺𝑆subscript𝐺𝐶sim_{G_{S},G_{C}}italic_s italic_i italic_m start_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT, calculated based on mincosθ𝑚𝑖subscript𝑛𝜃min_{\cos\theta}italic_m italic_i italic_n start_POSTSUBSCRIPT roman_cos italic_θ end_POSTSUBSCRIPT and mtsubscript𝑚𝑡m_{t}italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, it is likely that the source document has been used to train/fine-tune the model.

3.5 Assessing Training Usage Through Structure

In addition to assessing content similarity between the source document knowledge graph and continuation knowledge graph through cosine similarity, we propose taking into consideration graph structure by considering the two graphs’ degree of isomorphism. This is a measure of structural and organizational similarity between the source and the continuation.

Two graphs can be considered isomorphic if there exists an edge-preserving bijection that enables the one-to-one map** of two sets of vertices based on equivalent labels. We therefore measure the degree of isomorphism by considering graph edit distance. Graph edit distance is the minimum number of graph edit operations (insertion of nodes, merging/splitting of nodes, edge contraction, etc.) needed to morph one graph into another. As finding graph edit distance is an NP-hard problem, for our purposes, we turn to the most efficient heuristic algorithm when we aim to calculate graph edit distance.

However, because the source document knowledge graph and continuation knowledge graph may not have the same number of edges and vertices in all use cases based on the length of the source document, we propose a relative graph edit distance where the graph edit distance between the source document graph and continuation graph is divided by the sum of the graph edit distance between the source document graph and a null graph (K0subscript𝐾0K_{0}italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) and the graph edit distance between the continuation document graph and a null graph. This is because n𝑛nitalic_n number of edits necessary to morph one graph to another when both graphs have a small number of edges and vertices suggests greater structural difference compared to if the two graphs have significantly more edges and vertices.

Formally, normGED(GS,GC)𝑛𝑜𝑟𝑚𝐺𝐸𝐷subscript𝐺𝑆subscript𝐺𝐶normGED(G_{S},G_{C})italic_n italic_o italic_r italic_m italic_G italic_E italic_D ( italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) is defined as:

GED(GS,GC)GED(GS,K0)+GED(GC,K0)𝐺𝐸𝐷subscript𝐺𝑆subscript𝐺𝐶𝐺𝐸𝐷subscript𝐺𝑆subscript𝐾0𝐺𝐸𝐷subscript𝐺𝐶subscript𝐾0\frac{GED(G_{S},G_{C})}{GED(G_{S},K_{0})+GED(G_{C},K_{0})}divide start_ARG italic_G italic_E italic_D ( italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) end_ARG start_ARG italic_G italic_E italic_D ( italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + italic_G italic_E italic_D ( italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT , italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG (2)

In addition to GCcosθsubscript𝐺subscript𝐶𝜃G_{C_{\cos\theta}}italic_G start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_cos italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT, low values (arbitrary to the user) of normGED(GS,GC)𝑛𝑜𝑟𝑚𝐺𝐸𝐷subscript𝐺𝑆subscript𝐺𝐶normGED(G_{S},G_{C})italic_n italic_o italic_r italic_m italic_G italic_E italic_D ( italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) must also be taken into account when assessing whether an LLM was trained/fine-tuned on a source document. Lower values of normGED(GS,GC)𝑛𝑜𝑟𝑚𝐺𝐸𝐷subscript𝐺𝑆subscript𝐺𝐶normGED(G_{S},G_{C})italic_n italic_o italic_r italic_m italic_G italic_E italic_D ( italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) (arbitrary based on the use case) suggest greater structural similarity between the source and continuation knowledge graphs, indicating that the source document was used to train/fine-tune the LLM.

3.6 Assessing Training Usage Through Structure: Limitations

It is important to note the major limitation of the normGED(GS,GC)𝑛𝑜𝑟𝑚𝐺𝐸𝐷subscript𝐺𝑆subscript𝐺𝐶normGED(G_{S},G_{C})italic_n italic_o italic_r italic_m italic_G italic_E italic_D ( italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) metric in that it only considers the structure of two graphs. Thus, this metric cannot alone be used to compare whether GSsubscript𝐺𝑆G_{S}italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT and GCsubscript𝐺𝐶G_{C}italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT are similar enough to suggest that the source document was used in the training/fine-tuning of an LLM.

This is because normGED(GS,GC)𝑛𝑜𝑟𝑚𝐺𝐸𝐷subscript𝐺𝑆subscript𝐺𝐶normGED(G_{S},G_{C})italic_n italic_o italic_r italic_m italic_G italic_E italic_D ( italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) only considers structural similarity and neglects content (the syntactic meaning of the vertices and edges). Its sole usage may thus be misleading. Consider graph GFsubscript𝐺𝐹G_{F}italic_G start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT, made up of three RDF triples/one-edge walks with the central node "stocks," and graph GAsubscript𝐺𝐴G_{A}italic_G start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT made up of three RDF triples/one-edge walks with the central node "cars." Despite these knowledge graphs being very different when assessed using content, as they focus on two completely different topics, normGED(GF,GA)𝑛𝑜𝑟𝑚𝐺𝐸𝐷subscript𝐺𝐹subscript𝐺𝐴normGED(G_{F},G_{A})italic_n italic_o italic_r italic_m italic_G italic_E italic_D ( italic_G start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) will return 0 because these graphs have the exact same structure (as it requires 0 edit operations to transform one to another). Thus, in the event that a continuation is completely nonsensical yet possesses the same structure and organization of ideas as the source document, analyzing their respective knowledge graphs only through normGED(GS,GC)𝑛𝑜𝑟𝑚𝐺𝐸𝐷subscript𝐺𝑆subscript𝐺𝐶normGED(G_{S},G_{C})italic_n italic_o italic_r italic_m italic_G italic_E italic_D ( italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) would provide misleading results regarding similarity.

Therefore, it is important to establish some type of weighted compound metric that takes into account both content (GCcosθsubscript𝐺subscript𝐶𝜃G_{C_{\cos\theta}}italic_G start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_cos italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT) and structure (normGED(GS,GC)𝑛𝑜𝑟𝑚𝐺𝐸𝐷subscript𝐺𝑆subscript𝐺𝐶normGED(G_{S},G_{C})italic_n italic_o italic_r italic_m italic_G italic_E italic_D ( italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT )).

4 Summary and Conclusions

In this paper, we provide a novel system of determining whether a source document was used in the training/fine-tuning of an LLM. Unlike current methods, we leverage knowledge graphs by converting the source document and an LLM continuation of the source document (when given the first sentence) to RDF triples, which are then used to generate a knowledge graph for both.

After both graphs are vectorized using embeddings, every one-edge walk (which corresponds to an RDF triple) of the continuation knowledge graph, other than the one corresponding to the first sentence (as it is common among both the continuation and source document) is compared to one-edge walks of the source document knowledge graph with respect to cosine similarity. The top three matches (greatest cosine similarities), if above a user-defined threshold, add to a cumulative similarity. This process repeats for all one-edge walks of the continuation knowledge graph, with each cumulative similarity contributing to the graph’s total similarity, when divided by the number of one-edge walks yields total average similarity. When above a user-defined threshold, there is evidence that the source document was used in the fine-tuning/training of the LLM.

In addition to this assessment of content similarity, we propose a new metric to assess structural similarity between the knowledge graph and LLM continuation graph that measures their degree of isomorphism using a relative version of graph edit distance.

In all, our work provides a framework to assess the sourcing of training data for LLMs and helps bring greater accountability for responsible sourcing of training corpora.

5 Future Work

In a follow-up work, we plan to test our system and provide experimental data regarding its effectiveness. We would fine-tune an LLM on a fabricated source document not found on the Internet, then ask the LLM to provide a continuation for that document. We would then compare results on the aforementioned metrics on this fine-tuned LLM with a vanilla LLM.

Additionally, the thresholds mentioned in our work are up to the user and use case of the document and LLM. We believe finding definite values for various use cases would benefit users of our system and improve decision-making when considering "high" and "low" values for GCcosθsubscript𝐺subscript𝐶𝜃G_{C_{\cos\theta}}italic_G start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_cos italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT and normGED(GS,GC)𝑛𝑜𝑟𝑚𝐺𝐸𝐷subscript𝐺𝑆subscript𝐺𝐶normGED(G_{S},G_{C})italic_n italic_o italic_r italic_m italic_G italic_E italic_D ( italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ).

Furthermore, we have two independent metrics for assessing if a source document was used in the training/fine-tuning of an LLM, one based on content (GCcosθsubscript𝐺subscript𝐶𝜃G_{C_{\cos\theta}}italic_G start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT roman_cos italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT) and another based on structure (normGED(GS,GC)𝑛𝑜𝑟𝑚𝐺𝐸𝐷subscript𝐺𝑆subscript𝐺𝐶normGED(G_{S},G_{C})italic_n italic_o italic_r italic_m italic_G italic_E italic_D ( italic_G start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT , italic_G start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT )). As mentioned in section 3.6, it would be beneficial to create a combined metric to produce a single value, where content metrics and degree of isomorphism are both considered. This metric must accordingly weigh the content and structure metric, as two graphs with the same structure may have two completely different meanings and are thus dissimilar. For instance, the continuation and source knowledge graphs may organizationally have similar structures yet have completely different content and meanings. This would suggest that the source document was not used to train/fine-tune the LLM. The metric needs to thus weigh content and structure appropriately.

References

Appendix A Generating the RDF Triples

To generate the RDF triples, we utilize OpenAI’s ChatGPT 3.5 Turbo API and pass in an "assistant" and "user" message. Below is the base "assistant" message, which serves as the prompt:

Given a prompt, extrapolate as many relationships as possible from it and provide a list of updates. provide a json. If an update is a relationship, provide [ENTITY 1, RELATIONSHIP, ENTITY 2]. The relationship is directed, so the order matters. Make the relationship the most granular possible. Examples: prompt: Sun is source of light and heat. It is also source of Vitamin D. updates: [["Sun", "source of", "light"],["Sun", "source of", "heat],["Sun","source of", "Vitamin D"]] prompt: A planning process is critical for organizations, individuals and society. updates: [["planning", "is critical for", "organizations"],["planning", "is critical for", "individuals"],["planning", "is critical for", "society"]]

In our prompt, we provided a few examples in order to enable better in-context learning. We then passed in the source document as part of the "user" message.

Appendix B Generating the LLM Continuation

To generate the LLM continuation, we once again used OpenAI’s ChatGPT 3.5 Turbo API and passed in the following "user" message:

Based off of your training, generate a continuation for the following sentence firstLine. The continuation must EXACTLY be sentenceCount-1 sentences long.

We used a Python f string in order to insert the first line to the "user" message in place of the firstLine variable, as well as to insert the number of sentences the continuation needs to be (inserted in place of the sentenceCount variable).

Appendix C Other Technical Details

We used the Python framework NetworkX to create and visualize the knowledge graphs, as well as the OpenAI API to extract RDF triples. We utilized PineCone to create our vector database.