Croppable Knowledge Graph Embedding

Yushan Zhu
Zhejiang University &Wen Zhang
Zhejiang University &Zhiqiang Liu
Zhejiang University &Mingyang Chen
Zhejiang University &Lei Liang
Ant Group &Huajun Chen
Zhejiang University
Abstract

Knowledge Graph Embedding (KGE) is a common method for Knowledge Graphs (KGs) to serve various artificial intelligence tasks. The suitable dimensions of the embeddings depend on the storage and computing conditions of the specific application scenarios. Once a new dimension is required, a new KGE model needs to be trained from scratch, which greatly increases the training cost and limits the efficiency and flexibility of KGE in serving various scenarios. In this work, we propose a novel KGE training framework MED, through which we could train once to get a croppable KGE model applicable to multiple scenarios with different dimensional requirements, sub-models of the required dimensions can be cropped out of it and used directly without any additional training. In MED, we propose a mutual learning mechanism to improve the low-dimensional sub-models performance and make the high-dimensional sub-models retain the capacity that low-dimensional sub-models have, an evolutionary improvement mechanism to promote the high-dimensional sub-models to master the knowledge that the low-dimensional sub-models can not learn, and a dynamic loss weight to balance the multiple losses adaptively. Experiments on 3 KGE models over 4 standard KG completion datasets, 3 real application scenarios over a real-world large-scale KG, and the experiments of extending MED to the language model BERT show the effectiveness, high efficiency, and flexible extensibility of MED.

1 Introduction

Knowledge Graphs (KGs) are composed of triples representing facts in the form of (head entity, relation, tail entity), abbreviated as (h, r, t). KG has been widely used in recommendation systems DBLP:conf/mm/ZhuZZYCZC21 ; DBLP:conf/icde/ZhangWYWZC21 , information extraction DBLP:conf/acl/HoffmannZLZW11 ; DBLP:conf/i-semantics/DaiberJHM13 , question answering DBLP:journals/corr/ZhangLHJLW016 ; DBLP:conf/www/DiefenbachSM18 and other tasks. A common way to apply a knowledge graph is to represent the entities and relations in the knowledge graph into continuous vector spaces, called knowledge graph embedding (KGE) DBLP:conf/nips/BordesUGWY13 ; DBLP:conf/iclr/SunDNT19 , and then use the vector representation of entities and relations to serve a variety of tasks.

KGEs with higher dimensions have greater expressive power and usually achieve better performance, but this also means a larger number of parameters and requires more storage space and computing resources DBLP:conf/wsdm/ZhuZCCC0C22 ; DBLP:conf/acl/Sachan20 . The appropriate dimensions of the KGE are different for different devices or scenarios. As shown in Fig. 1, large remote servers have large storage space and sufficient computing resources to support high-dimensional KGE with good performance, while small and medium-sized terminal devices, such as vehicle-mounted systems or smartphones, can only accept low-dimensional KGE due to limited computing power and storage capacity. Therefore, according to the conditions of different devices or scenes, people tend to train the KGE with appropriate dimensions and as high quality as possible. However, the challenge is that once a new dimension is required, a new KGE needs to be trained from scratch. Especially when only low-dimensional KGE can be applied, to ensure good performance, the additional model compression technology such as knowledge distillation DBLP:journals/corr/HintonVD15 ; DBLP:conf/wsdm/ZhuZCCC0C22 is needed during training. This significantly increases training costs and limits KGE’s efficiency and flexibility in serving different scenarios.

Refer to caption
Figure 1: Diverse KGE dimensions for a KG.

Thus a new concept "croppable KGE" is proposed and we are interested in the research question that is it possible to train a croppable KGE, with which KGEs of various required dimensions can be cropped out of it, directly be used without any additional training, and achieve promising performance?

In this work, our main idea of croppable KGE learning is to train an entire KGE that contains many sub-models of different dimensions in it. These sub-models share their embedding parameters and are trained simultaneously. The goal is that the low-dimensional sub-models can benefit from the more expressive high-dimensional sub-models, while the high-dimensional sub-models retain the ability of the low-dimensional sub-models and master the knowledge that the low-dimensional sub-models cannot. Based on this idea, we propose a croppable KGE training framework MED, which consists of three main modules, the Mutual learning mechanism, the Evolutionary improvement mechanism, and the Dynamic loss weight to achieve the above purpose. Specifically, the mutual learning mechanism is based on knowledge distillation and it makes pairwise neighbor sub-models learn from each other, so that the performance of the lower-dimensional sub-model can be improved, and the higher-dimensional sub-model can retain the ability of the lower-dimensional sub-model. The evolutionary improvement mechanism helps the high-dimensional sub-model master more knowledge that the low-dimensional sub-model cannot by making the high-dimensional sub-model pay more attention to learn the triples that the low-dimensional sub-model can’t correctly predict. The dynamic loss weight is designed to adaptively balance multiple losses of different sub-models according to their dimensions and further improve the overall performance.

We evaluate the effectiveness of our proposed MED by implementing it on three typical KGE methods and four standard KG datasets. We also prove its practical value by applying MED to a real-world large-scale KG and downstream tasks. Furthermore, we demonstrate the extensibility of MED by implementing it on language model BERT DBLP:conf/naacl/DevlinCLT19 and GLUE DBLP:conf/iclr/WangSMHLB19 benchmarks. The experimental results show that (1) MED successfully trains a croppable KGE model available for various dimensional requirements, which contains multiple parameter-shared sub-models of different dimensions that of high performance and can be used directly without additional training; (2) the training efficiency of MED is far higher than that of independently training multiple KGE models of different sizes or obtaining them by knowledge distillation. (3) MED can be flexibly extended to other neural network models besides KGE and achieve good performance; (4) our proposed mutual learning mechanism, evolutionary improvement mechanism, and dynamic loss weight are effective and necessary for MED to achieve overall optimal performance. In summary, our contributions are as follows:

  • We propose a new research question and task: training croppable KGE, from which KGEs of different dimensions can be cropped and used directly without any additional training.

  • We propose a novel framework MED, including a mutual learning mechanism, an evolutionary improvement mechanism, and a dynamic loss weight, to ensure the overall performance of all sub-models during training the croppable KGE.

  • We experimentally prove that all sub-models of MED work well, especially the performance of the low-dimensional sub-models exceeding the KGE with the same dimension trained by the state-of-the-art distillation-based methods. MED also shows excellent performance in real-world applications and good extensibility on other types of neural networks.

2 Related Work

This work is to achieve a croppable KGE that meets different dimensional requirements. One of the most common methods to obtain a good-performance KGE of the target dimension is utilizing knowledge distillation with a high-dimensional powerful teacher KGE. Thus, we focus on two research fields most relevant to our work: knowledge graph embedding and knowledge distillation.

2.1 Knowledge Graph Embedding

Knowledge graph embedding (KGE) technology has been widely applied with the key idea of map** entities and relations of a KG into continuous vector spaces as vector representations, which can further serve various KG downstream tasks. TransE DBLP:conf/nips/BordesUGWY13 is the most representative translation-based KGE method by regarding the relation as a translation from the head to tail entity. Variants of TransE include TransH DBLP:conf/aaai/WangZFC14 , TransR DBLP:conf/aaai/LinLSLZ15 , TransD DBLP:conf/acl/JiHXL015 and so on. RESCAL DBLP:conf/icml/NickelTK11 is the first one based on vector decomposition, and then to improve it, DistMult DBLP:journals/corr/YangYHGD14a , ComplEx DBLP:conf/icml/TrouillonWRGB16 , and SimplE DBLP:conf/nips/Kazemi018 are proposed. RotatE DBLP:conf/iclr/SunDNT19 is a typical rotation-based method that regards the relation as the rotation between the head and tail entities. QuatE DBLP:conf/nips/0007TYL19 and DihEdral DBLP:conf/acl/XuL19 work with a similar idea. PairRE DBLP:conf/acl/ChaoHWC20 uses two relation vectors to project the head and tail entities into an Euclidean space to encode complex relational patterns. With the development of neural networks, KGEs based on graph neural networks (GNNs) DBLP:conf/aaai/DettmersMS018 ; DBLP:conf/naacl/NguyenNNP18 ; DBLP:conf/esws/SchlichtkrullKB18 ; DBLP:conf/iclr/VashishthSNT20 are also proposed. Although the KGEs are simple and effective, there is an obvious challenge: In different scenarios, the required KGE dimensions are different, which depends on the storage and computing resources of the device. It has to train a new KGE model from scratch for a new dimension requirement, which greatly increases the training cost and limits the flexibility for KGE to serve diversified scenarios.

2.2 Knowledge Distillation

High-dimensional KGEs have strong expression ability due to the large number of parameters, but require a lot of storage and computing resources, and are not suitable for all scenarios, especially small devices. To solve this problem, a common way is to compress a high-dimensional KGE to the target low-dimensional KGE by knowledge distillation DBLP:journals/corr/HintonVD15 ; DBLP:conf/aaai/MirzadehFLLMG20 and quantization DBLP:conf/acl/BaiZHSJJLLK20 ; DBLP:conf/iclr/StockFGGGJJ21 technology.

Quantization replaces continuous vector representations with lower-dimensional discrete codes. TS-CL DBLP:conf/acl/Sachan20 is the first work of KGE compression applying quantization. LightKG DBLP:conf/cikm/WangWLG21 uses a residual module to induce diversity among codebooks. However, quantization cannot improve the inference speed so it’s still not suitable for devices with limited computing resources.

Knowledge distillation (KD) has been widely used in Computer Vision DBLP:conf/aaai/MirzadehFLLMG20 and Natural Language Processing DBLP:conf/naacl/DevlinCLT19 ; DBLP:conf/emnlp/SunCGL19 , hel** reduce the model size and increase the inference speed. The core idea is to use the output of a large teacher model to guide the training of a small student model. DualDE DBLP:conf/wsdm/ZhuZCCC0C22 is a representative KD-based work to transfer the knowledge of high-dimensional KGE to low-dimensional KGE. It considers the mutual influences between the teacher and student and finetunes the teacher during training.MulDE DBLP:conf/www/Wang0MS21 transfers the knowledge from multiple low-dimensional teacher models to a student model for hyperbolic KGE. ISD DBLP:journals/corr/abs-2206-02963 improves low-dimensional KGE by making it play the teacher and student roles alternatively during training. Among these methods, DualDE DBLP:conf/wsdm/ZhuZCCC0C22 is more relevant to our work, both have the setting of high-dimensional teacher and low-dimensional student models. In this work, we propose a novel KD-based KGE training framework MED, one training can obtain a croppable KGE that meets multiple dimensional requirements.

3 Preliminary

Table 1: Score functions.
KGE method Scoring Function f(𝐡,𝐫,𝐭)𝑓𝐡𝐫𝐭f(\mathbf{h},\mathbf{r},\mathbf{t})italic_f ( bold_h , bold_r , bold_t )
TransE DBLP:conf/nips/BordesUGWY13 𝐡+𝐫𝐭norm𝐡𝐫𝐭-\left\|\mathbf{h}+\mathbf{r}-\mathbf{t}\right\|- ∥ bold_h + bold_r - bold_t ∥
RotatE DBLP:conf/iclr/SunDNT19 𝐡𝐫𝐭norm𝐡𝐫𝐭-\left\|\mathbf{h}\circ\mathbf{r}-\mathbf{t}\right\|- ∥ bold_h ∘ bold_r - bold_t ∥
PairRE DBLP:conf/acl/ChaoHWC20 𝐡𝐫H𝐭𝐫Tnorm𝐡superscript𝐫𝐻𝐭superscript𝐫𝑇-\left\|\mathbf{h}\circ\mathbf{r}^{H}-\mathbf{t}\circ\mathbf{r}^{T}\right\|- ∥ bold_h ∘ bold_r start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT - bold_t ∘ bold_r start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥

Knowledge graph embedding (KGE) methods aim to express the relations between entities in a continuous vector space through a scoring function f𝑓fitalic_f. Specifically, given a knowledge graph 𝒢=(,,𝒯)𝒢𝒯\mathcal{G}=(\mathcal{E},\mathcal{R},\mathcal{T})caligraphic_G = ( caligraphic_E , caligraphic_R , caligraphic_T ) where \mathcal{E}caligraphic_E, \mathcal{R}caligraphic_R and 𝒯𝒯\mathcal{T}caligraphic_T are the sets of entities, relations and all observed triples, we utilize the triple scoring function to measure the plausibility of triples in the embedding space for a triple (h,r,t)𝑟𝑡(h,r,t)( italic_h , italic_r , italic_t ) where h,rformulae-sequence𝑟h\in\mathcal{E},r\in\mathcal{R}italic_h ∈ caligraphic_E , italic_r ∈ caligraphic_R and t𝑡t\in\mathcal{E}italic_t ∈ caligraphic_E. The triple score function is denoted as s(h,r,t)=f(𝐡,𝐫,𝐭)subscript𝑠𝑟𝑡𝑓𝐡𝐫𝐭s_{(h,r,t)}=f(\mathbf{h},\mathbf{r},\mathbf{t})italic_s start_POSTSUBSCRIPT ( italic_h , italic_r , italic_t ) end_POSTSUBSCRIPT = italic_f ( bold_h , bold_r , bold_t ) with embeddings of head entity h, relation r and tail entity t as input. Table 1 summarizes the scoring functions of some popular KGE methods, where \circ is the Hadamard product. The higher the triple score, the more likely the model is to judge the triples as true.

4 MED Framework

As shown in Fig. 2, our croppable KGE framework MED contains multiple (let’s say n𝑛nitalic_n) sub-models of different dimensions in it, denoted as Mi(i=1,2,n)subscript𝑀𝑖𝑖12𝑛M_{i}(i=1,2...,n)italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_i = 1 , 2 … , italic_n ) with dimension of disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Each sub-model

Refer to caption
Figure 2: Overview of MED.

Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is composed of the first disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT dimensions of the whole embedding and the score of triple (h,r,t)𝑟𝑡(h,r,t)( italic_h , italic_r , italic_t ) output by Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is s(h,r,t)i=f(𝐡[0s_{(h,r,t)}^{i}=f(\mathbf{h}[0italic_s start_POSTSUBSCRIPT ( italic_h , italic_r , italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = italic_f ( bold_h [ 0::::di],𝐫[0d_{i}],\mathbf{r}[0italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] , bold_r [ 0::::di],𝐭[0d_{i}],\mathbf{t}[0italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] , bold_t [ 0::::di])d_{i}])italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ), where 𝐡[0\mathbf{h}[0bold_h [ 0::::di]d_{i}]italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] represents the first disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT elements of vector 𝐡𝐡\mathbf{h}bold_h. The parameters of sub-model Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are shared by all sub-models Mj(iM_{j}(iitalic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_i<<<j𝑗jitalic_j\leqslantn)n)italic_n ) that are higher-dimensional than it. The number of sub-models n𝑛nitalic_n and the specific dimension of each sub-model disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be set according to the actual application needs. For low-dimensional sub-models, we want to improve their performance as much as possible. For high-dimensional sub-models, we hope they cover the abilities that low-dimensional sub-models already have and master the knowledge that low-dimensional sub-models can not learn well, that is, they need to correctly predict not only the triples that low-dimensional sub-models can predict correctly but also those low-dimensional sub-models predict wrongly.

MED is based on knowledge distillation DBLP:journals/corr/HintonVD15 ; DBLP:journals/corr/abs-1903-12136 ; DBLP:conf/naacl/DevlinCLT19 technique that the student learns by fitting the hard (ground-truth) label and the soft label from the teacher simultaneously. In MED, we first propose a mutual learning mechanism that makes low-dimensional sub-models learn from high-dimensional sub-models to achieve better performance, and makes high-dimensional sub-models also learn from low-dimensional sub-models to retain the abilities that low-dimensional sub-models already have. Then, we propose an evolutionary improvement mechanism to enable high-dimensional sub-models to master the knowledge that the low-dimensional sub-models can not learn well. Finally, we train MED with dynamic loss weight to adaptively balance multiple optimization objectives of sub-models.

4.1 Mutual Learning Mechanism

We treat each sub-model Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the student of its higher-dimensional neighbor sub-model Mi+1subscript𝑀𝑖1M_{i+1}italic_M start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT to achieve better performance, since high-dimensional KGEs usually have more expressive power than low-dimensional ones due to more parameters DBLP:conf/acl/Sachan20 ; DBLP:conf/wsdm/ZhuZCCC0C22 . We also treat sub-model Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as the student of its lower-dimensional neighbor sub-model Mi1subscript𝑀𝑖1M_{i-1}italic_M start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT, so the higher-dimensional sub-model can review what the lower-dimensional sub-model has learned and retain the low-dimensional one’s existing abilities. Thus, pairwise neighbor sub-models serve as both teachers and students, learning from each other. The mutual learning loss between each pair of neighbor sub-models is

LMLi1,i=(h,r,t)𝒯𝒯dδ(s(h,r,t)i1,s(h,r,t)i),1<in,formulae-sequencesuperscriptsubscript𝐿𝑀𝐿𝑖1𝑖subscript𝑟𝑡𝒯superscript𝒯subscript𝑑𝛿superscriptsubscript𝑠𝑟𝑡𝑖1superscriptsubscript𝑠𝑟𝑡𝑖1𝑖𝑛L_{ML}^{i-1,i}=\sum_{(h,r,t)\in\mathcal{T}\cup\mathcal{T}^{-}}d_{\delta}\left(% s_{(h,r,t)}^{i-1},s_{(h,r,t)}^{i}\right),1<i\leqslant n,italic_L start_POSTSUBSCRIPT italic_M italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 , italic_i end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT ( italic_h , italic_r , italic_t ) ∈ caligraphic_T ∪ caligraphic_T start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT ( italic_h , italic_r , italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT , italic_s start_POSTSUBSCRIPT ( italic_h , italic_r , italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , 1 < italic_i ⩽ italic_n , (1)

where s(h,r,t)isuperscriptsubscript𝑠𝑟𝑡𝑖s_{(h,r,t)}^{i}italic_s start_POSTSUBSCRIPT ( italic_h , italic_r , italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the score of triple (h,r,t)𝑟𝑡(h,r,t)( italic_h , italic_r , italic_t ) output by sub-model Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and reflects the possibility that this triplet exists, 𝒯=××𝒯superscript𝒯𝒯\mathcal{T}^{-}=\mathcal{E}\times\mathcal{R}\times\mathcal{E}\setminus\mathcal% {T}caligraphic_T start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT = caligraphic_E × caligraphic_R × caligraphic_E ∖ caligraphic_T is the negative triple set, n𝑛nitalic_n is the number of sub-models, and dδsubscript𝑑𝛿d_{\delta}italic_d start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT is Huber loss Huber1964Robust with δ=1𝛿1\delta=1italic_δ = 1 commonly used in knowledge distillation for KGE  DBLP:conf/wsdm/ZhuZCCC0C22 . MED makes each sub-model only learn from its neighbor sub-models. The advantage is that this not only reduces the computational complexity of training but also makes every pair of teacher and student models have a relatively small dimension gap, which is important and effective because the large gap of dimensions between teacher and student will destroy the distillation effect DBLP:conf/aaai/MirzadehFLLMG20 ; DBLP:conf/wsdm/ZhuZCCC0C22 .

4.2 Evolutionary Improvement Mechanism

The hard (ground-truth) label is the other important supervision signal during training in knowledge distillation DBLP:journals/corr/HintonVD15 . High-dimensional sub-models need to master triples that low-dimensional sub-models can not learn well, that is, high-dimensional sub-models need to correctly predict those positive (negative) triples that are wrongly predicted to be negative (positive) by low-dimensional sub-models. In MED, for a given triple (h,r,t)𝑟𝑡(h,r,t)( italic_h , italic_r , italic_t ), the optimization weight in sub-model Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for it depends on the triple score output by the previous sub-model Mi1subscript𝑀𝑖1M_{i-1}italic_M start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT.

For a positive triple, the optimization weight of the model Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for it is negatively correlated with its score by the model Mi1subscript𝑀𝑖1M_{i-1}italic_M start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT. Specifically, the higher its score from the model Mi1subscript𝑀𝑖1M_{i-1}italic_M start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT (meaning that Mi1subscript𝑀𝑖1M_{i-1}italic_M start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT has been able to correctly judge it as a positive sample), the lower the optimization weight of the model Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for it, and the lower its score from the model Mi1subscript𝑀𝑖1M_{i-1}italic_M start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT (meaning that Mi1subscript𝑀𝑖1M_{i-1}italic_M start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT wrongly judges it as a negative sample), the higher the optimization weight of the model Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for it because Mi1subscript𝑀𝑖1M_{i-1}italic_M start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT cannot predict this triple well. The optimization weight of Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the positive triple is

posh,r,ti=expw1/s(h,r,t)i1(h,r,t)Tbatchexpw1/s(h,r,t)i1if 1<in;1|Tbatch|ifi=1,formulae-sequence𝑝𝑜superscriptsubscript𝑠𝑟𝑡𝑖subscript𝑤1superscriptsubscript𝑠𝑟𝑡𝑖1subscript𝑟𝑡subscript𝑇𝑏𝑎𝑡𝑐subscript𝑤1superscriptsubscript𝑠𝑟𝑡𝑖1if1𝑖𝑛1subscript𝑇𝑏𝑎𝑡𝑐if𝑖1pos_{h,r,t}^{i}=\frac{\exp w_{1}/s_{(h,r,t)}^{i-1}}{\sum_{(h,r,t)\in T_{batch}% }\exp w_{1}/s_{(h,r,t)}^{i-1}}\ \texttt{if}\ 1<i\leqslant n\ ;\quad\frac{1}{|T% _{batch}|}\ \texttt{if}\ i=1,italic_p italic_o italic_s start_POSTSUBSCRIPT italic_h , italic_r , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = divide start_ARG roman_exp italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / italic_s start_POSTSUBSCRIPT ( italic_h , italic_r , italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT ( italic_h , italic_r , italic_t ) ∈ italic_T start_POSTSUBSCRIPT italic_b italic_a italic_t italic_c italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_exp italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / italic_s start_POSTSUBSCRIPT ( italic_h , italic_r , italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT end_ARG if 1 < italic_i ⩽ italic_n ; divide start_ARG 1 end_ARG start_ARG | italic_T start_POSTSUBSCRIPT italic_b italic_a italic_t italic_c italic_h end_POSTSUBSCRIPT | end_ARG if italic_i = 1 , (2)

where s(h,r,t)i1superscriptsubscript𝑠𝑟𝑡𝑖1s_{(h,r,t)}^{i-1}italic_s start_POSTSUBSCRIPT ( italic_h , italic_r , italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT is the score for triple (h,r,t)𝑟𝑡(h,r,t)( italic_h , italic_r , italic_t ) output by the sub-model Mi1subscript𝑀𝑖1M_{i-1}italic_M start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT, Tbatchsubscript𝑇𝑏𝑎𝑡𝑐T_{batch}italic_T start_POSTSUBSCRIPT italic_b italic_a italic_t italic_c italic_h end_POSTSUBSCRIPT is the set of positive triples within a batch, and w1subscript𝑤1w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a learnable scaling parameter. Conversely, for a negative triple, the optimization weight of the model Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for it is positively correlated with its score by the model Mi1subscript𝑀𝑖1M_{i-1}italic_M start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT. The optimization weight of Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the negative triple is

negh,r,ti=expw2s(h,r,t)i1(h,r,t)Tbatchexpw2s(h,r,t)i1if 1<in;1|Tbatch|ifi=1,formulae-sequence𝑛𝑒superscriptsubscript𝑔𝑟𝑡𝑖subscript𝑤2superscriptsubscript𝑠𝑟𝑡𝑖1subscript𝑟𝑡superscriptsubscript𝑇𝑏𝑎𝑡𝑐subscript𝑤2superscriptsubscript𝑠𝑟𝑡𝑖1if1𝑖𝑛1superscriptsubscript𝑇𝑏𝑎𝑡𝑐if𝑖1neg_{h,r,t}^{i}=\frac{\exp w_{2}\cdot s_{(h,r,t)}^{i-1}}{\sum_{(h,r,t)\in T_{% batch}^{-}}\exp w_{2}\cdot s_{(h,r,t)}^{i-1}}\ \texttt{if}\ 1<i\leqslant n\ ;% \quad\frac{1}{|T_{batch}^{-}|}\ \texttt{if}\ i=1,italic_n italic_e italic_g start_POSTSUBSCRIPT italic_h , italic_r , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = divide start_ARG roman_exp italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_s start_POSTSUBSCRIPT ( italic_h , italic_r , italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT ( italic_h , italic_r , italic_t ) ∈ italic_T start_POSTSUBSCRIPT italic_b italic_a italic_t italic_c italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_exp italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_s start_POSTSUBSCRIPT ( italic_h , italic_r , italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT end_ARG if 1 < italic_i ⩽ italic_n ; divide start_ARG 1 end_ARG start_ARG | italic_T start_POSTSUBSCRIPT italic_b italic_a italic_t italic_c italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT | end_ARG if italic_i = 1 , (3)

where Tbatchsuperscriptsubscript𝑇𝑏𝑎𝑡𝑐T_{batch}^{-}italic_T start_POSTSUBSCRIPT italic_b italic_a italic_t italic_c italic_h end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT is the set of negative triples within a batch, and w2subscript𝑤2w_{2}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a learnable scaling parameter.

Therefore, the evolutionary improvement loss of the sub-model Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is

LEIi=(h,r,t)𝒯𝒯posh,r,tiylogσ(s(h,r,t)i)+negh,r,ti(1y)log(1σ(s(h,r,t)i)),superscriptsubscript𝐿𝐸𝐼𝑖subscript𝑟𝑡𝒯superscript𝒯𝑝𝑜superscriptsubscript𝑠𝑟𝑡𝑖𝑦𝜎superscriptsubscript𝑠𝑟𝑡𝑖𝑛𝑒superscriptsubscript𝑔𝑟𝑡𝑖1𝑦1𝜎superscriptsubscript𝑠𝑟𝑡𝑖\displaystyle L_{EI}^{i}=-\sum_{(h,r,t)\in\mathcal{T}\cup\mathcal{T}^{-}}pos_{% h,r,t}^{i}\cdot y\log\sigma(s_{(h,r,t)}^{i})+neg_{h,r,t}^{i}\cdot(1-y)\log(1-% \sigma(s_{(h,r,t)}^{i})),italic_L start_POSTSUBSCRIPT italic_E italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = - ∑ start_POSTSUBSCRIPT ( italic_h , italic_r , italic_t ) ∈ caligraphic_T ∪ caligraphic_T start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_p italic_o italic_s start_POSTSUBSCRIPT italic_h , italic_r , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ italic_y roman_log italic_σ ( italic_s start_POSTSUBSCRIPT ( italic_h , italic_r , italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) + italic_n italic_e italic_g start_POSTSUBSCRIPT italic_h , italic_r , italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ⋅ ( 1 - italic_y ) roman_log ( 1 - italic_σ ( italic_s start_POSTSUBSCRIPT ( italic_h , italic_r , italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) , (4)

where σ𝜎\sigmaitalic_σ is the Sigmoid activation function, y𝑦yitalic_y is the ground-truth label of the triple (h,r,t)𝑟𝑡(h,r,t)( italic_h , italic_r , italic_t ), and it is 1111 for positive triples and 00 for negative ones. In each sub-model, different hard (ground-truth) label loss weights are set for different triples, and the high-dimensional sub-model will pay more attention to learn the triple that the low-dimensional sub-model can not learn well.

4.3 Dynamic Loss Weight

Since MED involves the optimization of multiple sub-models, we set dynamic loss weights during training. Initially, low-dimensional sub-models prioritize learning from high-dimensional sub-models to improve performance. This means low-dimensional sub-models rely more on soft label information, so for low-dimensional sub-models, evolutionary improvement loss should account for less than mutual learning loss. Conversely, high-dimensional sub-models should focus more on capturing knowledge that low-dimensional models lack, while mitigating the impact of low-quality outputs from low-dimensional models to maintain their good performance, that is, high-dimensional sub-models rely more on hard label information. So for high-dimensional sub-models, evolutionary improvement loss should account for more than mutual learning loss. For a teacher-student pair, their mutual learning loss acts on both teacher and student models simultaneously, so the effect of mutual learning loss for them is theoretically the same. We set different evolutionary improvement loss weights for different sub-models, and the final training loss function of MED is

L=i=2nLMLi1,i+i=1nexp(w3didn)LEIi,𝐿superscriptsubscript𝑖2𝑛superscriptsubscript𝐿𝑀𝐿𝑖1𝑖superscriptsubscript𝑖1𝑛subscript𝑤3subscript𝑑𝑖subscript𝑑𝑛superscriptsubscript𝐿𝐸𝐼𝑖L=\sum_{i=2}^{n}L_{ML}^{i-1,i}+\sum_{i=1}^{n}\exp(\frac{w_{3}\cdot d_{i}}{d_{n% }})\cdot L_{EI}^{i},italic_L = ∑ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_M italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 , italic_i end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_exp ( divide start_ARG italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⋅ italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) ⋅ italic_L start_POSTSUBSCRIPT italic_E italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , (5)

where w3subscript𝑤3w_{3}italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT is a learnable scaling parameter, and disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the dimension of the i𝑖iitalic_ith sub-model.

5 Experiment

We evaluate MED on typical KGE and GLUE benchmarks and particularly answer the following research questions: (RQ1) Is it capable for MED to train a croppable KGE at once that multiple sub-models of different dimensions can be cropped from it and all achieve promising performance? (RQ2) Can MED finally achieve parameter-efficient KGE models? (RQ3) Does MED work in real-world applications? (RQ4) Can MED be extended to other neural networks besides KGE?

5.1 Experiment Setting

5.1.1 Dataset and KGE methods

MED is universal and can be applied to any KGE method with a triple score function, we select three commonly used KGE methods as examples: TransE DBLP:conf/nips/BordesUGWY13 , RotatE DBLP:conf/iclr/SunDNT19 and PairRE DBLP:conf/acl/ChaoHWC20 , the triple score functions are described in Table 1.

Table 2: Statistics of datasets.
Dataset #Ent. #Rel. #Train #Valid #Test
WN18RR 40,943 11 86,835 3,034 3,134
FB15K237 14,541 237 272,115 17,535 20,466
CoDEx-L 77,951 69 551,193 30,622 30,622
YAGO3-10 123,143 37 1,079,040 4,978 4,982
SKG 6,974,959 15 50,775,620 - -

We conduct comparison experiments on two common KG completion benchmark datasets WN18RR DBLP:conf/emnlp/ToutanovaCPPCG15 and FB15K237 DBLP:conf/aaai/DettmersMS018 and two more larger-scale KGs CoDEx-L DBLP:conf/emnlp/SafaviK20 and YAGO3-10 DBLP:conf/cidr/MahdisoltaniBS15 . Besides, we apply our MED on a real-world large-scale e-commerce social knowledge graph (SKG) involving more than 50 million triples of social records by about 7 million users in the Taobao platform in real application scenarios. Table 2 shows the statistics of the datasets.

5.1.2 Evaluation Metric

For the link prediction task, we adopt standard metrics MRR and Hit@k𝑘kitalic_k (k=1,3,10)𝑘1310(k=1,3,10)( italic_k = 1 , 3 , 10 ) in the filtered setting DBLP:conf/nips/BordesUGWY13 . We use Effi DBLP:conf/aaai/ChenZYZGPC23 , that is MRR/#P (#P is the number of parameters), to quantify the parameter efficiency of models. We use f1-score and accuracy for the user labeling task, and normalized discounted cumulative gain ndcg@k(k=5,10)𝑘𝑘510k(k=5,10)italic_k ( italic_k = 5 , 10 ) for the product recommendation task.

5.1.3 Implementation

For the link prediction task, we set dn=640subscript𝑑𝑛640d_{n}=640italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = 640 for the highest-dimensional sub-model Mnsubscript𝑀𝑛M_{n}italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and d1=10subscript𝑑110d_{1}=10italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 10 for the lowest-dimensional sub-model M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. We set n=64𝑛64n=64italic_n = 64 and the same dimension size gap 10101010 for every pair of neighbor sub-models so that there are a total of 64646464 available sub-models of different dimensions from 10 to 640 in our croppable KGE model. The dimension of sub-model Mi(i=1,2,64)subscript𝑀𝑖𝑖1264M_{i}(i=1,2...,64)italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_i = 1 , 2 … , 64 ) is 10×i10𝑖10\times i10 × italic_i. For the user labeling and product recommendation task, we set n=3𝑛3n=3italic_n = 3 and train the croppable KGE containing 3 sub-models: M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with d1=10subscript𝑑110d_{1}=10italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 10 for mobile phone (MB) terminals that are limited by storage and computing resources, M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT with d2=100subscript𝑑2100d_{2}=100italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 100 for the personal computer (PC), and M3subscript𝑀3M_{3}italic_M start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT with d3=500subscript𝑑3500d_{3}=500italic_d start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 500 for the platform’s servers. We initialize the learnable scaling parameters wi,w2subscript𝑤𝑖subscript𝑤2w_{i},w_{2}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and w3subscript𝑤3w_{3}italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT in (2), (3) and (5) to 1. We implement MED by extending OpenKE DBLP:conf/emnlp/HanCLLLSL18 , an open-source KGE framework based on PyTorch. We set the batch size to 1024102410241024 and the maximum training epoch to 3000300030003000 with early stop**. For each positive triple, we generate 64646464 negative triples by randomly replacing its head or tail entity with another entity. We use Adam DBLP:journals/corr/KingmaB14 optimizer with a linear decay learning rate scheduler and perform a search on the initial learning rate in {0.0001,0.0005,0.001,0.01}0.00010.00050.0010.01\{0.0001,0.0005,0.001,0.01\}{ 0.0001 , 0.0005 , 0.001 , 0.01 }. We train all sub-models simultaneously by optimizing the uniformly sampled sub-models from the full Croppable model in each step.

5.1.4 Baselines

For each required dimension drsubscript𝑑𝑟d_{r}italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, we extract the first drsubscript𝑑𝑟d_{r}italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT dimensions from our croppable KGE as the target model and compare it to the KGE models obtained by 7 baselines of the following 3 types:

  • Directly training the target KGE model of requirement dimension drsubscript𝑑𝑟d_{r}italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, referred to as 1) DT. The directly trained highest-dimensional KGE model (dr=dnsubscript𝑑𝑟subscript𝑑𝑛d_{r}=d_{n}italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) is marked as MmaxDTsuperscriptsubscript𝑀𝑚𝑎𝑥𝐷𝑇M_{max}^{DT}italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D italic_T end_POSTSUPERSCRIPT.

  • Extracting the first drsubscript𝑑𝑟d_{r}italic_d start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT dimensions from MmaxDTsuperscriptsubscript𝑀𝑚𝑎𝑥𝐷𝑇M_{max}^{DT}italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D italic_T end_POSTSUPERSCRIPT as the target model, referred to as 2) Ext. Besides, we update MmaxDTsuperscriptsubscript𝑀𝑚𝑎𝑥𝐷𝑇M_{max}^{DT}italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D italic_T end_POSTSUPERSCRIPT by assessing the importance of each one of 640 dimensions and arranging them in descending order before extracting as DBLP:conf/iclr/MolchanovTKAK17 ; DBLP:conf/acl/VoitaTMST19 : 3) Ext-L, the importance for each dimension of MmaxDTsuperscriptsubscript𝑀𝑚𝑎𝑥𝐷𝑇M_{max}^{DT}italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D italic_T end_POSTSUPERSCRIPT is the variation of KGE loss on validation set after removing it; and 4) Ext-V, the importance for each dimension is the average absolute of its parameter weights of all entities and all relations.

  • Distilling the target KGE by KD methods: 5) BKD DBLP:journals/corr/HintonVD15 is the most basic one by minimizing the KL divergence of the output distributions of the teacher and student; 6) TA DBLP:conf/aaai/MirzadehFLLMG20 uses a medium-size teaching assistant (TA) model as a bridge for size gap, where TA model has the same dimension as the directly trained one whose MRR is closest to the average MRR of the teacher and student; and 7) DualDE DBLP:conf/wsdm/ZhuZCCC0C22 compresses KGE by optimizing the teacher and student simultaneously. We do not compare with MulDE DBLP:conf/www/Wang0MS21 , which uses multiple low-dimensional different KGE models as teachers to aggregate the knowledge of different KGE models into one rather than compress a high-dimensional KGE. In these baselines, MmaxDTsuperscriptsubscript𝑀𝑚𝑎𝑥𝐷𝑇M_{max}^{DT}italic_M start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D italic_T end_POSTSUPERSCRIPT is the teacher, and other settings including hyperparameters are the same as their original papers.

5.2 Performance Comparison

We report the link prediction results of some representative dimensions in Table 3, more results of other dimensions and metrics are in Appendix A and the ablation studies are in Appendix B.

Table 3: MRR and Hit@10 (H10) of some dimensions on WN18RR (WN) and FB15K237 (FB).
WN18RR FB15K237
10d 40d 160d 640d 10d 40d 160d 640d
KGE Method MRR H10 MRR H10 MRR H10 MRR H10 MRR H10 MRR H10 MRR H10 MRR H10
TransE DT 0.121 0.287 0.214 0.496 0.233 0.531 0.237 0.537 0.150 0.235 0.299 0.477 0.315 0.499 0.322 0.508
Ext 0.125 0.298 0.199 0.468 0.225 0.515 0.237 0.537 0.115 0.211 0.236 0.392 0.286 0.462 0.322 0.508
Ext-L 0.139 0.315 0.224 0.497 0.236 0.534 0.237 0.537 0.109 0.194 0.232 0.381 0.285 0.462 0.322 0.508
Ext-V 0.139 0.309 0.222 0.494 0.236 0.532 0.237 0.537 0.139 0.256 0.237 0.396 0.293 0.466 0.322 0.508
BKD 0.141 0.323 0.226 0.513 0.233 0.531 - - 0.176 0.293 0.303 0.480 0.315 0.501 - -
TA 0.144 0.335 0.226 0.512 0.234 0.533 - - 0.175 0.246 0.303 0.484 0.319 0.504 - -
DualDE 0.148 0.337 0.225 0.514 0.235 0.533 - - 0.179 0.301 0.306 0.483 0.319 0.505 - -
MED 0.170 0.388 0.232 0.518 0.236 0.529 0.237 0.537 0.196 0.341 0.308 0.486 0.320 0.505 0.322 0.507
RotatE DT 0.172 0.418 0.456 0.556 0.471 0.567 0.476 0.575 0.254 0.424 0.312 0.495 0.322 0.506 0.325 0.515
Ext 0.299 0.378 0.437 0.516 0.467 0.549 0.476 0.575 0.138 0.245 0.251 0.410 0.291 0.465 0.325 0.515
Ext-L 0.206 0.277 0.399 0.487 0.445 0.541 0.476 0.575 0.135 0.243 0.221 0.365 0.280 0.453 0.325 0.515
Ext-V 0.261 0.377 0.337 0.471 0.416 0.532 0.476 0.575 0.160 0.281 0.238 0.393 0.288 0.458 0.325 0.515
BKD 0.175 0.434 0.457 0.556 0.472 0.570 - - 0.277 0.442 0.314 0.503 0.322 0.510 - -
TA 0.177 0.438 0.459 0.558 0.473 0.572 - - 0.280 0.447 0.313 0.501 0.323 0.510 - -
DualDE 0.179 0.440 0.462 0.559 0.473 0.573 - - 0.282 0.449 0.315 0.502 0.322 0.512 - -
MED 0.324 0.469 0.466 0.561 0.471 0.574 0.476 0.574 0.288 0.459 0.318 0.504 0.323 0.510 0.324 0.514
PairRE DT 0.220 0.321 0.415 0.472 0.449 0.534 0.453 0.544 0.182 0.314 0.284 0.452 0.319 0.505 0.332 0.522
Ext 0.152 0.209 0.334 0.463 0.419 0.526 0.453 0.544 0.148 0.222 0.217 0.353 0.294 0.469 0.332 0.522
Ext-L 0.162 0.220 0.363 0.442 0.437 0.523 0.453 0.544 0.150 0.249 0.219 0.333 0.309 0.489 0.332 0.522
Ext-V 0.172 0.260 0.389 0.456 0.441 0.529 0.453 0.544 0.176 0.277 0.229 0.374 0.311 0.490 0.332 0.522
BKD 0.228 0.336 0.421 0.483 0.451 0.536 - - 0.198 0.332 0.288 0.453 0.321 0.508 - -
TA 0.245 0.340 0.426 0.487 0.452 0.537 - - 0.208 0.346 0.292 0.455 0.323 0.509 - -
DualDE 0.242 0.336 0.428 0.495 0.453 0.540 - - 0.207 0.342 0.293 0.456 0.326 0.512 - -
MED 0.317 0.376 0.433 0.502 0.451 0.541 0.451 0.542 0.239 0.384 0.303 0.466 0.324 0.510 0.330 0.520

MED outperforms baselines in almost all settings, especially for the extremely low dimensions. On WN18RR with d𝑑ditalic_d=10, MED achieves an improvement of 14.9% and 15.1% on TransE, 8.4% and 6.6% on RotatE, 29.4% and 10.6% on PairRE compared with the best MRR and Hit@10 of baselines. We can observe a similar phenomenon on FB15K237. This benefits from the rich knowledge sources of low-dimensional models in MED: For sub-model Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Mi+1subscript𝑀𝑖1M_{i+1}italic_M start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT is the teacher directly next to it, while Mi+2subscript𝑀𝑖2M_{i+2}italic_M start_POSTSUBSCRIPT italic_i + 2 end_POSTSUBSCRIPT can also indirectly affect Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by directly affecting Mi+1subscript𝑀𝑖1M_{i+1}italic_M start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT. Theoretically, all higher-dimensional sub-models can finally transfer their knowledge to low-dimensional sub-models through stepwise propagation. Although such stepwise propagation may have negative effects on high-dimensional models by bringing low-quality knowledge from low-dimensional sub-models, the evolutionary improvement mechanism in MED weakens the damage and makes high-dimensional ones still

Refer to caption
Refer to caption
Figure 3: Results of different dimensions for PairRE on WN18RR (left) and FB15K237 (right).

achieve competitive performance than directly trained KGEs as in Fig. 3. We also find that Ext-based methods perform extremely unstable: Ext, Ext-L, and Ext-V work worse than DT except on WN18RR with TransE, indicating that only considering the importance of each dimension is not enough to guarantee the performance of all sub-models. More results and ablation studies are in Appendix A and Appendix B.

5.3 Parameter efficiency of MED

In Table 4, we compare our sub-models of suitable low dimensions to parameter-efficient KGEs especially proposed for large-scale KGs including NodePiece DBLP:conf/iclr/0001DWH22 and EARL DBLP:conf/aaai/ChenZYZGPC23 . In the case that the number of model parameters is roughly equivalent, the performance of the sub-models of MED exceeds that of the specialized parameter-efficient KGE methods. This demonstrates sub-models of our method are parameter efficient. More importantly, it can provide parameter-efficient models of different size for applications.

Table 4: Link prediction results on WN18RR, FB15K237, CoDEx-L and YAGO3-10.
FB15k-237 WN18RR CoDEx-L YAGO3-10
Dim #P(M) MRR Hit@10 Effi Dim #P(M) MRR Hit@10 Effi Dim #P(M) MRR Hit@10 Effi Dim #P(M) MRR Hit@10 Effi
RotatE 1000 29.3 0.336 0.532 0.011 500 40.6 0.508 0.612 0.013 500 78 0.258 0.387 0.003 500 123.2 0.495 0.670 0.004
RotatE 100 2.9 0.296 0.473 0.102 50 4.1 0.411 0.429 0.100 25 3.8 0.196 0.322 0.052 20 4.8 0.121 0.262 0.025
+ NodePiece 100 3.2 0.256 0.420 0.080 100 4.4 0.403 0.515 0.092 100 3.6 0.190 0.313 0.053 100 4.1 0.247 0.488 0.060
+ EARL 150 1.8 0.310 0.501 0.172 200 3.8 0.440 0.527 0.116 100 2.1 0.238 0.390 0.113 100 3 0.302 0.498 0.101
+ MED 40 1.2 0.318 0.504 0.265 40 3.2 0.466 0.561 0.146 20 3.1 0.243 0.385 0.078 20 4.9 0.313 0.528 0.064

5.4 MED in real applications

We apply the trained croppable KGE with TransE on SKG to three real applications: the user labeling task on servers and the product recommendation task on PCs and mobile phones. Table 5 shows that our croppable user embeddings substantially exceed all baselines including directly trained (DT), the best baseline DualDE, and a common dimension reduction method in industry principal components

Table 5: Results on SKG.
User Labeling Product Recommendation
server (500d) PC terminal (100d) MP terminal (10d)
Method acc. f1 ndcg@5 ndcg@10 ndcg@5 ndcg@10
DT 0.889 0.874 0.411 0.441 0.344 0.361
PCA - - 0.417 0.447 0.392 0.418
DualDE - - 0.423 0.456 0.404 0.433
MED 0.893 0.879 0.431 0.465 0.422 0.451

analysis (PCA) on MmaxDTsubscriptsuperscript𝑀𝐷𝑇𝑚𝑎𝑥M^{DT}_{max}italic_M start_POSTSUPERSCRIPT italic_D italic_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT. Notably, the excellent performance on the mobile phone task (which can only carry embeddings with a maximum dimension of 10 limited by storage and computing resources) demonstrates the enormous practical value of our approach. More application details are in Appendix C.

5.5 Extend MED to Neural Networks

To verify the extensibility of our method to other neural networks, we take the language model BERT DBLP:conf/naacl/DevlinCLT19 as an example. To ensure the consistency of the experimental environment as much as possible, we uniformly adopt distillation methods implemented based on Hugging Face Transformers DBLP:conf/emnlp/WolfDSCDMCRLFDS20 as baselines. Following previous works DBLP:conf/emnlp/SunCGL19 ; DBLP:journals/corr/abs-1903-12136 ; DBLP:journals/eswa/JungKNK23 ; DBLP:conf/acl/ZhouXM22 , we do not use pre-training distillation settings and only distill at the fine-tuning stage. More experimental details are in Appendix D.

Table 6: Results on the dev set of GLUE. The results of knowledge distillation methods for BERT4 and BERT6 are reported by DBLP:journals/eswa/JungKNK23 ; DBLP:conf/acl/ZhouXM22 and the results reported by us.
Method #P(M) Speedup
MNLI-m
acc.
MNLI-mm
acc.
MRPC
f1/acc.
QNLI
acc.
QQP
f1/acc.
RTE
acc.
STS-2
acc.
STS-B
pear./spear.
BERTBasesuperscriptsubscriptabsent𝐵𝑎𝑠𝑒{}_{Base}^{\dagger}start_FLOATSUBSCRIPT italic_B italic_a italic_s italic_e end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT 110 1.0×\times× 84.4 85.3 88.6/84.1 89.7 89.6/91.1 67.5 92.5 88.8/88.5
BERT6-BKD 66 2.0×\times× 82.2 82.9 86.2/80.8 88.5 88.0/91.0 65.4 90.9 88.2/87.8
BERT6-PKD 66 2.0×\times× 82.3 82.6 86.4/81.0 88.6 87.9/91.0 63.9 90.8 88.5/88.1
BERT6-MiniLM 66 2.0×\times× 82.2 82.6 84.6/78.1 89.5 87.2/90.5 61.5 90.2 87.8/87.5
BERT6-RKD 66 2.0×\times× 82.4 82.9 86.9/81.8 88.9 88.1/91.2 65.2 91.0 88.4/88.1
BERT6-FSD 66 2.0×\times× 82.4 83.0 87.1/82.2 89.0 88.1/91.2 66.6 91.0 88.7/88.3
BERT4-BKD 55 2.9×\times× 80.5 80.9 87.2/83.1 87.5 86.6/90.4 65.2 90.2 84.5/84.2
BERT4-PKD 55 2.9×\times× 80.9 81.3 87.0/82.9 87.7 86.8/90.5 66.1 90.5 84.3/84.0
BERT4-MetaDistil 55 2.9×\times× 82.4 82.7 88.4/84.2 88.6 87.8/90.8 67.8 91.8 86.3/86.0
BERT-HAT 54 2.0×\times× 70.8 71.6 81.2/74.8 65.3 76.1/80.4 52.7 84.3 79.6/80.1
BERT-MED 54 2.0×\times× 82.7 83.3 88.0/84.0 86.8 89.1/90.7 67.2 91.9 87.6/87.2
BERT-HAT 17.5 4.7×\times× 63.6 64.2 68.4/78.4 61.1 69.0/79.7 47.2 82.9 74.1/75.8
BERT-MED 17.5 4.7×\times× 81.2 82.4 86.1/82.0 86.4 83.8/86.2 64.6 88.2 86.1/86.4
BERT-HAT 6.36 5.2×\times× 59.9 60.0 66.5/77.3 60.1 66.5/77.1 46.2 81.7 71.9/70.4
BERT-MED 6.36 5.2×\times× 72.6 73.7 84.1/78.1 86.0 79.6/82.7 61.7 86.9 82.8/81.6

Table 6 shows the results on the development set of GLUE DBLP:conf/iclr/WangSMHLB19 . We compare MED with other KD models under similar speedup or a comparable number of parameters. The results show that MED achieves competitive performance on most tasks compared to BERT-specialized KD methods. In addition, when compared to HAT DBLP:conf/acl/WangWLCZGH20 , which shares the most similar model architecture to ours, sub-models of MED outperform HAT across three different parameter quantities. Specifically, sub-models with 54M, 17.5M, and 6.36M parameters achieve average 16.3%percent16.316.3\%16.3 %, 21.7%percent21.721.7\%21.7 % and 19.7%percent19.719.7\%19.7 % improvements respectively.

5.6 Analysis of MED

5.6.1 Training efficiency

Table 7: Training time (hours).
TransE RotatE PairRE
WN DT 74.0 (9.49×\times×) 141.0 (11.10×\times×) 67.4 (10.06×\times×)
Ext-based 1.5 (0.19×\times×) 2.5 (0.20×\times×) 1.6 (0.24×\times×)
BKD 91.5 (11.73×\times×) 163.0 (12.83×\times×) 87.5 (13.06×\times×)
TA 172.0 (22.05×\times×) 272.0 (21.42×\times×) 166.0 (24.78×\times×)
DualDE 151.0 (19.36×\times×) 240.0 (18.90×\times×) 133.0 (19.85×\times×)
MED 7.8 (1.00×\times×) 12.7 (1.00×\times×) 6.7 (1.00×\times×)
FB DT 218.0 (10.23×\times×) 381.0 (10.73×\times×) 179.0 (9.37×\times×)
Ext-based 4.7 (0.22×\times×) 9.5 (0.27×\times×) 3.7 (0.19×\times×)
BKD 248.0 (11.64×\times×) 443.0 (12.48×\times×) 231.0 (12.09×\times×)
TA - - - - - -
DualDE - - - - - -
MED 21.3 (1.00×\times×) 35.5 (1.00×\times×) 19.1 (1.00×\times×)

We report the training time of obtaining 64 models of all sizes (d𝑑ditalic_d=10, 20, …, 640) by different methods in Table 7. For DT, the training time cost is the sum of the time of directly training 64 KGE models of all sizes in turn. For the Ext-based baselines, the training time cost is the same and is equal to the time of training a dnsubscript𝑑𝑛d_{n}italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-dimensional KGE model since the time of arranging dimensions is very short and negligible. For the KD-based baselines, the training time cost is the sum of the time of training the dnsubscript𝑑𝑛d_{n}italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT-dimensional teacher model and distilling 63 student models (d𝑑ditalic_d=10, 20, …, 630) in turn. All training is performed on a single NVIDIA Tesla A100 40GB GPU for fair comparison. For TA and DualDE on FB15K237, we don’t train student models of all 63 sizes, which is estimated to take more than 400 hours on each KGE method. Compared with directly trained (DT) models of all sizes in turn, MED accelerates by up to 10×\times× for 3 KGE methods. Although Ext-based baselines spend the shortest training time, they perform particularly poorly and lack practical value. TA and DualDE need to optimize both the student model and a larger teacher model, which greatly increases the training parameters and time cost.

5.6.2 Whether high-dimensional sub-models cover the capabilities of low-dimensional ones

If a high-dimensional model retains the ability of lower-dimensional models, it should correctly predict all triples that the lower-dimensional model can predict. We count the percentage of triples in test set that meet the condition that if the smallest sub-model that can correctly predict a given triple is Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, all higher-dimensional sub-models (Mi+1subscript𝑀𝑖1M_{i+1}italic_M start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT, Mi+2subscript𝑀𝑖2M_{i+2}italic_M start_POSTSUBSCRIPT italic_i + 2 end_POSTSUBSCRIPT, …, Mnsubscript𝑀𝑛M_{n}italic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT) also correctly predict it, and denote the result as the ability retention ratio (ARR). We use Hit@10 to judge whether a triple is correctly predicted, that is, Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT correctly predicts a triple if Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT scores this triple in the top 10 among all candidate triples.

Refer to caption
Figure 4: The ability retention ratio (ARR).

From Fig. 4, we find that ARR of MED is always much higher than baselines, especially on FB15K237, indicating that high-dimensional sub-models in MED successfully cover the power of low-dimensional ones, contributed by the mutual learning mechanism that helps high-dimensional sub-models review what low-dimensional sub-models have learned. Based on this advantage of MED, we can also provide a simple way to judge how easy or difficult a triple is for KGE methods to learn: the triple that low-dimensional sub-models can correctly predict may be easy since more high-dimensional models can also predict it, while triples that can only be predicted by a particularly high-dimensional sub-model are difficult.

5.6.3 Visual analysis of embedding

Refer to caption
Figure 5: Clustering on FB15K237 with RotatE.

We select four primary entity categories (‘organization’, ‘sports’, ‘location’, and ‘music’) that contain more than 300 entities in FB15K237, and randomly select 250 entities for each. We cluster these entities’ embeddings of 3 different dimensions (d𝑑ditalic_d=10, 100, 600) by the t-SNE algorithm, and the clustering results are visualized in Fig. 5. Under the same dimension, the clustering result of MED is always the best, followed by DualDE, while the result of Ext-V is generally poor, which is consistent with the conclusion in Section 5.2. We also find some special phenomenons for MED when dimension increases: 1) the nodes of the ‘sports’ gradually become two clusters meaning MED learns more fine-grained category information as dimension increases. and 2) the relative distribution among different categories hardly changes and shows a trend of “inheritance” and “improvement”. This further proves MED achieves our expectation that high-dimensional sub-models retain the ability of low-dimensional sub-models, and can learn more knowledge than low-dimensional sub-models.

6 Conclusion

In this work, we propose a novel KGE training framework, MED, that trains a croppable KGE at once, and then sub-models of various required dimensions can be cropped out from it and used directly without additional training. In MED, we propose the mutual learning mechanism to improve low-dimensional sub-models performance and make the high-dimensional sub-models retain the ability of the low-dimensional ones, the evolutionary improvement mechanism to motivate high-dimensional sub-models to master more knowledge that low-dimensional ones cannot, and the dynamic loss weight to adaptively balance multiple losses. The experimental results show the effectiveness and high efficiency of our method, where all sub-models achieve promising performance, especially the performance of low-dimensional sub-models is greatly improved. In future work, we will further explore the more fine-grained information encoding ability of each sub-model.

References

  • [1] Yushan Zhu, Huaixiao Zhao, Wen Zhang, Ganqiang Ye, Hui Chen, Ningyu Zhang, and Huajun Chen. Knowledge perceived multi-modal pretraining in e-commerce. In ACM Multimedia, pages 2744–2752. ACM, 2021.
  • [2] Wen Zhang, Chi Man Wong, Ganqiang Ye, Bo Wen, Wei Zhang, and Huajun Chen. Billion-scale pre-trained e-commerce product knowledge graph model. In ICDE, pages 2476–2487. IEEE, 2021.
  • [3] Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke S. Zettlemoyer, and Daniel S. Weld. Knowledge-based weak supervision for information extraction of overlap** relations. In ACL, pages 541–550. The Association for Computer Linguistics, 2011.
  • [4] Joachim Daiber, Max Jakob, Chris Hokamp, and Pablo N. Mendes. Improving efficiency and accuracy in multilingual entity extraction. In I-SEMANTICS, pages 121–124. ACM, 2013.
  • [5] Yuanzhe Zhang, Kang Liu, Shizhu He, Guoliang Ji, Zhanyi Liu, Hua Wu, and Jun Zhao. Question answering over knowledge base with neural attention combining global knowledge information. CoRR, abs/1606.00979, 2016.
  • [6] Dennis Diefenbach, Kamal Deep Singh, and Pierre Maret. Wdaqua-core1: A question answering service for RDF knowledge bases. In WWW (Companion Volume), pages 1087–1091. ACM, 2018.
  • [7] Antoine Bordes, Nicolas Usunier, Alberto García-Durán, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In NIPS, pages 2787–2795, 2013.
  • [8] Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. Rotate: Knowledge graph embedding by relational rotation in complex space. In ICLR (Poster). OpenReview.net, 2019.
  • [9] Yushan Zhu, Wen Zhang, Mingyang Chen, Hui Chen, Xu Cheng, Wei Zhang, and Huajun Chen. Dualde: Dually distilling knowledge graph embedding for faster and cheaper reasoning. In WSDM, pages 1516–1524. ACM, 2022.
  • [10] Mrinmaya Sachan. Knowledge graph embedding compression. In ACL, pages 2681–2691. Association for Computational Linguistics, 2020.
  • [11] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015.
  • [12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), pages 4171–4186. Association for Computational Linguistics, 2019.
  • [13] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
  • [14] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding by translating on hyperplanes. In AAAI, pages 1112–1119. AAAI Press, 2014.
  • [15] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relation embeddings for knowledge graph completion. In AAAI, pages 2181–2187. AAAI Press, 2015.
  • [16] Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and Jun Zhao. Knowledge graph embedding via dynamic map** matrix. In ACL (1), pages 687–696. The Association for Computer Linguistics, 2015.
  • [17] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective learning on multi-relational data. In ICML, pages 809–816. Omnipress, 2011.
  • [18] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and relations for learning and inference in knowledge bases. In ICLR (Poster), 2015.
  • [19] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In ICML, volume 48 of JMLR Workshop and Conference Proceedings, pages 2071–2080. JMLR.org, 2016.
  • [20] Seyed Mehran Kazemi and David Poole. Simple embedding for link prediction in knowledge graphs. In NeurIPS, pages 4289–4300, 2018.
  • [21] Shuai Zhang, Yi Tay, Lina Yao, and Qi Liu. Quaternion knowledge graph embeddings. In NeurIPS, pages 2731–2741, 2019.
  • [22] Canran Xu and Ruijiang Li. Relation embedding with dihedral group in knowledge graph. In ACL (1), pages 263–272. Association for Computational Linguistics, 2019.
  • [23] Linlin Chao, Jianshan He, Taifeng Wang, and Wei Chu. Pairre: Knowledge graph embeddings via paired relation vectors. In ACL/IJCNLP (1), pages 4360–4369. Association for Computational Linguistics, 2021.
  • [24] Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2d knowledge graph embeddings. In AAAI, pages 1811–1818. AAAI Press, 2018.
  • [25] Dai Quoc Nguyen, Tu Dinh Nguyen, Dat Quoc Nguyen, and Dinh Q. Phung. A novel embedding model for knowledge base completion based on convolutional neural network. In NAACL-HLT (2), pages 327–333. Association for Computational Linguistics, 2018.
  • [26] Michael Sejr Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In ESWC, volume 10843 of Lecture Notes in Computer Science, pages 593–607. Springer, 2018.
  • [27] Shikhar Vashishth, Soumya Sanyal, Vikram Nitin, and Partha P. Talukdar. Composition-based multi-relational graph convolutional networks. In ICLR. OpenReview.net, 2020.
  • [28] Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant. In AAAI, pages 5191–5198. AAAI Press, 2020.
  • [29] Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, ** **, Xin Jiang, Qun Liu, Michael R. Lyu, and Irwin King. Binarybert: Pushing the limit of BERT quantization. In ACL/IJCNLP (1), pages 4334–4348. Association for Computational Linguistics, 2021.
  • [30] Pierre Stock, Angela Fan, Benjamin Graham, Edouard Grave, Rémi Gribonval, Hervé Jégou, and Armand Joulin. Training with quantization noise for extreme model compression. In ICLR. OpenReview.net, 2021.
  • [31] Haoyu Wang, Yaqing Wang, Defu Lian, and **g Gao. A lightweight knowledge graph embedding framework for efficient inference and storage. In CIKM, pages 1909–1918. ACM, 2021.
  • [32] Siqi Sun, Yu Cheng, Zhe Gan, and **g**g Liu. Patient knowledge distillation for BERT model compression. In EMNLP/IJCNLP (1), pages 4322–4331. Association for Computational Linguistics, 2019.
  • [33] Kai Wang, Yu Liu, Qian Ma, and Quan Z. Sheng. Mulde: Multi-teacher knowledge distillation for low-dimensional knowledge graph embeddings. In WWW, pages 1716–1726. ACM / IW3C2, 2021.
  • [34] Zhehui Zhou, Defang Chen, Can Wang, Yan Feng, and Chun Chen. Improving knowledge graph embedding via iterative self-semantic knowledge distillation. CoRR, abs/2206.02963, 2022.
  • [35] Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. Distilling task-specific knowledge from BERT into simple neural networks. CoRR, abs/1903.12136, 2019.
  • [36] Huber and J. Peter. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1):73–101, 1964.
  • [37] Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon. Representing text for joint embedding of text and knowledge bases. In EMNLP, pages 1499–1509. The Association for Computational Linguistics, 2015.
  • [38] Tara Safavi and Danai Koutra. Codex: A comprehensive knowledge graph completion benchmark. In EMNLP (1), pages 8328–8350. Association for Computational Linguistics, 2020.
  • [39] Farzaneh Mahdisoltani, Joanna Biega, and Fabian M. Suchanek. YAGO3: A knowledge base from multilingual wikipedias. In CIDR. www.cidrdb.org, 2015.
  • [40] Mingyang Chen, Wen Zhang, Zhen Yao, Yushan Zhu, Yang Gao, Jeff Z. Pan, and Huajun Chen. Entity-agnostic representation learning for parameter-efficient knowledge graph embedding. In AAAI, pages 4182–4190. AAAI Press, 2023.
  • [41] Xu Han, Shulin Cao, Xin Lv, Yankai Lin, Zhiyuan Liu, Maosong Sun, and Juanzi Li. Openke: An open toolkit for knowledge embedding. In EMNLP (Demonstration), pages 139–144. Association for Computational Linguistics, 2018.
  • [42] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR (Poster), 2015.
  • [43] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. In ICLR (Poster). OpenReview.net, 2017.
  • [44] Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In ACL (1), pages 5797–5808. Association for Computational Linguistics, 2019.
  • [45] Mikhail Galkin, Etienne G. Denis, Jiapeng Wu, and William L. Hamilton. Nodepiece: Compositional and parameter-efficient representations of large knowledge graphs. In ICLR. OpenReview.net, 2022.
  • [46] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art natural language processing. In Qun Liu and David Schlangen, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020 - Demos, Online, November 16-20, 2020, pages 38–45. Association for Computational Linguistics, 2020.
  • [47] Hee-Jun Jung, Doyeon Kim, Seung-Hoon Na, and Kangil Kim. Feature structure distillation with centered kernel alignment in BERT transferring. Expert Syst. Appl., 234:120980, 2023.
  • [48] Wangchunshu Zhou, Canwen Xu, and Julian J. McAuley. BERT learns to teach: Knowledge distillation with meta learning. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 7037–7049. Association for Computational Linguistics, 2022.
  • [49] Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han. HAT: hardware-aware transformers for efficient natural language processing. In ACL, pages 7675–7688. Association for Computational Linguistics, 2020.
  • [50] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural collaborative filtering. In WWW, pages 173–182. ACM, 2017.
  • [51] William B. Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing, IWP@IJCNLP 2005, Jeju Island, Korea, October 2005, 2005. Asian Federation of Natural Language Processing, 2005.
  • [52] Alexis Conneau and Douwe Kiela. Senteval: An evaluation toolkit for universal sentence representations. In Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Kôiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018. European Language Resources Association (ELRA), 2018.
  • [53] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1631–1642. ACL, 2013.
  • [54] Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Marilyn A. Walker, Heng Ji, and Amanda Stent, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics, 2018.
  • [55] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for machine comprehension of text. In Jian Su, Xavier Carreras, and Kevin Duh, editors, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 2383–2392. The Association for Computational Linguistics, 2016.
  • [56] Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In CVPR, pages 3967–3976. Computer Vision Foundation / IEEE, 2019.
  • [57] Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.

Appendix A More Results of link prediction

More results of link prediction are shown in Table 8 and Table 9 for WN18RR, and Table 10 and Table 11 for FB15K237. All comparison results of sub-models of MED to the directly trained KGEs (DT) of 10- to 640-dimension are shown in Fig. 6.

Table 8: MRR and Hit@1 of some representative dimensions on WN18RR.
10d 20d 40d 80d 160d 320d 640d
Method MRR Hit@1 MRR Hit@1 MRR Hit@1 MRR Hit@1 MRR Hit@1 MRR Hit@1 MRR Hit@1
TransE DT .121 .011 .176 .016 .214 .018 .227 .025 .233 .027 .235 .033 .237 .034
Ext .125 .016 .172 .023 .199 .023 .213 .028 .225 .033 .226 .028 .237 .034
Ext-L .139 .029 .196 .025 .224 .039 .232 .046 .236 .036 .236 .033 .237 .034
Ext-V .139 .029 .198 .045 .222 .051 .234 .047 .236 .036 .236 .027 .237 .034
BKD .141 .035 .207 .040 .226 .033 .232 .031 .233 .030 .236 .032 - -
TA .144 .040 .211 .043 .226 .037 .233 .030 .234 .030 .236 .034 - -
DualDE .148 .037 .213 .043 .225 .037 .234 .031 .235 .031 .238 .034 - -
MED .170 .040 .219 .045 .232 .048 .232 .042 .236 .037 .237 .033 .237 .031
RotatE DT .172 .005 .409 .357 .456 .393 .465 .420 .471 .423 .474 .428 .476 .429
Ext .299 .257 .379 .335 .437 .395 .458 .415 .467 .413 .471 .418 .476 .429
Ext-L .206 .166 .336 .288 .399 .352 .423 .373 .445 .396 .466 .417 .476 .429
Ext-V .261 .197 .304 .234 .337 .263 .366 .293 .416 .357 .451 .397 .476 .429
BKD .175 .009 .424 .361 .457 .403 .471 .421 .472 .424 .474 .425 - -
TA .177 .010 .424 .363 .459 .408 .470 .420 .473 .422 .474 .425 - -
DualDE .179 .011 .425 .364 .462 .412 .471 .423 .473 .426 .475 .425 - -
MED .324 .277 .456 .409 .466 .418 .471 .422 .471 .424 .476 .427 .476 .428
PairRE DT .220 .174 .342 .313 .415 .384 .435 .399 .449 .405 .452 .406 .453 .407
Ext .152 .120 .261 .198 .334 .267 .375 .314 .419 .364 .438 .388 .453 .407
Ext-L .162 .129 .281 .237 .363 .319 .417 .377 .437 .395 .446 .400 .453 .407
Ext-V .172 .124 .306 .269 .389 .352 .420 .379 .441 .398 .446 .400 .453 .407
BKD .228 .184 .375 .334 .421 .372 .443 .405 .451 .405 .453 .407 - -
TA .245 .197 .381 .332 .426 .380 .448 .404 .452 .409 .453 .408 - -
DualDE .242 .175 .377 .330 .428 .381 .451 .409 .453 .410 .454 .410 - -
MED .317 .259 .408 .367 .433 .392 .449 .405 .451 .406 .451 .407 .451 .406
Table 9: Hit@10 and Hit@3 of some representative dimensions on WN18RR.
10d 20d 40d 80d 160d 320d 640d
Method Hit@10 Hit@3 Hit@10 Hit@3 Hit@10 Hit@3 Hit@10 Hit@3 Hit@10 Hit@3 Hit@10 Hit@3 Hit@10 Hit@3
TransE DT .287 .202 .453 .291 .496 .385 .524 .401 .531 .403 .534 .407 .537 .412
Ext .298 .201 .423 .285 .468 .338 .495 .364 .515 .384 .521 .388 .537 .412
Ext-L .315 .218 .461 .317 .497 .361 .516 .403 .534 .405 .535 .408 .537 .412
Ext-V .309 .218 .458 .314 .494 .391 .525 .407 .532 .408 .536 .411 .537 .412
BKD .323 .216 .480 .331 .513 .392 .527 .401 .531 .404 .533 .407 - -
TA .335 .224 .483 .343 .512 .395 .527 .408 .533 .407 .535 .410 - -
DualDE .337 .226 .488 .346 .514 .394 .530 .408 .533 .408 .535 .411 - -
MED .388 .269 .491 .369 .518 .399 .523 .404 .529 .407 .536 .410 .537 .412
RotatE DT .418 .304 .504 .436 .556 .475 .564 .487 .567 .489 .573 .491 .575 .493
Ext .378 .315 .464 .399 .516 .452 .544 .472 .549 .480 .552 .470 .575 .493
Ext-L .277 .224 .424 .359 .487 .420 .515 .441 .541 .461 .564 .481 .575 .493
Ext-V .377 .289 .433 .336 .471 .377 .497 .402 .532 .442 .561 .467 .575 .493
BKD .434 .312 .540 .452 .556 .479 .565 .487 .570 .490 .572 .492 - -
TA .438 .314 .542 .452 .558 .481 .567 .489 .572 .488 .572 .492 - -
DualDE .440 .320 .542 .452 .559 .483 .567 .489 .573 .488 .573 .491 - -
MED .469 .354 .543 .476 .561 .486 .568 .490 .574 .492 .573 .493 .574 .495
PairRE DT .321 .271 .381 .368 .472 .428 .516 .450 .534 .463 .542 .462 .544 .464
Ext .209 .163 .379 .292 .463 .366 .493 .398 .526 .437 .545 .452 .544 .464
Ext-L .220 .175 .360 .302 .442 .383 .495 .431 .523 .450 .544 .455 .544 .464
Ext-V .260 .192 .374 .323 .456 .407 .498 .435 .529 .452 .541 .458 .544 .464
BKD .336 .279 .413 .388 .483 .435 .525 .452 .536 .460 .542 .463 - -
TA .340 .293 .427 .387 .487 .437 .534 .460 .537 .462 .543 .463 - -
DualDE .336 .281 .424 .389 .495 .437 .536 .463 .540 .463 .544 .465 - -
MED .376 .314 .467 .426 .502 .443 .537 .462 .541 .464 .542 .465 .542 .464
Table 10: MRR and Hit@1 of some representative dimensions on FB15K237.
10d 20d 40d 80d 160d 320d 640d
Method MRR Hit@1 MRR Hit@1 MRR Hit@1 MRR Hit@1 MRR Hit@1 MRR Hit@1 MRR Hit@1
TransE DT .150 .102 .277 .190 .299 .212 .313 .218 .315 .222 .318 .224 .322 .228
Ext .115 .065 .191 .122 .236 .156 .266 .180 .286 .197 .299 .208 .322 .228
Ext-L .109 .065 .175 .115 .232 .157 .263 .180 .285 .198 .301 .210 .322 .228
Ext-V .139 .081 .200 .126 .237 .156 .270 .185 .293 .205 .308 .217 .322 .228
BKD .176 .106 .279 .198 .303 .208 .315 .222 .315 .223 .320 .226 - -
TA .175 .112 .281 .200 .303 .212 .314 .220 .319 .225 .321 .223 - -
DualDE .179 .115 .281 .201 .306 .216 .316 .223 .319 .226 .322 .227 - -
MED .196 .122 .290 .199 .308 .218 .317 .223 .320 .226 .321 .227 .322 .227
RotatE DT .254 .168 .297 .207 .312 .223 .317 .224 .322 .229 .323 .230 .325 .234
Ext .138 .080 .203 .129 .251 .170 .276 .190 .291 .203 .305 .217 .325 .234
Ext-L .135 .078 .188 .121 .221 .146 .246 .166 .280 .193 .299 .209 .325 .234
Ext-V .160 .097 .198 .126 .238 .159 .265 .182 .288 .201 .302 .213 .325 .234
BKD .277 .193 .305 .214 .314 .224 .321 .230 .322 .230 .323 .231 - -
TA .280 .196 .306 .216 .313 .225 .319 .229 .323 .229 .323 .231 - -
DualDE .282 .197 .307 .216 .315 .227 .318 .230 .322 .232 .324 .233 - -
MED .288 .201 .311 .216 .318 .225 .322 .231 .323 .233 .324 .233 .324 .232
PairRE DT .182 .116 .243 .162 .284 .202 .307 .222 .319 .227 .328 .235 .332 .237
Ext .148 .107 .177 .118 .217 .149 .259 .182 .294 .207 .321 .230 .332 .237
Ext-L .150 .099 .196 .134 .219 .159 .271 .188 .309 .219 .326 .233 .332 .237
Ext-V .176 .116 .192 .125 .229 .154 .279 .193 .311 .221 .329 .237 .332 .237
BKD .198 .132 .251 .168 .288 .203 .311 .224 .321 .233 .330 .236 - -
TA .208 .139 .263 .182 .292 .210 .314 .224 .323 .232 .332 .235 - -
DualDE .207 .139 .261 .179 .293 .212 .316 .226 .326 .234 .334 .238 - -
MED .239 .172 .274 .189 .303 .213 .314 .224 .324 .232 .329 .236 .330 .235
Table 11: Hit@10 and Hit@3 of some representative dimensions on FB15K237.
10d 20d 40d 80d 160d 320d 640d
Method Hit@10 Hit@3 Hit@10 Hit@3 Hit@10 Hit@3 Hit@10 Hit@3 Hit@10 Hit@3 Hit@10 Hit@3 Hit@10 Hit@3
TransE DT .235 .169 .440 .301 .477 .327 .484 .340 .499 .348 .501 .353 .508 .358
Ext .211 .123 .324 .211 .392 .264 .436 .296 .462 .320 .479 .331 .508 .358
Ext-L .194 .118 .293 .192 .381 .256 .424 .292 .462 .316 .484 .333 .508 .358
Ext-V .256 .150 .348 .222 .396 .265 .437 .301 .466 .325 .488 .341 .508 .358
BKD .293 .178 .446 .308 .480 .336 .500 .349 .501 .349 .502 .354 - -
TA .246 .188 .441 .307 .484 .336 .498 .348 .504 .353 .504 .355 - -
DualDE .301 .193 .443 .307 .483 .337 .502 .351 .505 .354 .508 .356 - -
MED .341 .215 .472 .321 .486 .338 .502 .347 .505 .351 .507 .356 .507 .358
RotatE DT .424 .284 .477 .330 .495 .346 .502 .352 .506 .353 .510 .357 .515 .363
Ext .245 .152 .340 .225 .410 .278 .443 .304 .465 .322 .485 .335 .515 .363
Ext-L .243 .147 .319 .209 .365 .247 .402 .275 .453 .312 .477 .333 .515 .363
Ext-V .281 .174 .340 .218 .393 .264 .427 .293 .458 .319 .478 .336 .515 .363
BKD .442 .306 .485 .338 .503 .352 .508 .354 .510 .356 .509 .358 - -
TA .447 .308 .485 .339 .501 .353 .507 .358 .510 .359 .509 .358 - -
DualDE .449 .311 .486 .341 .502 .353 .507 .360 .512 .361 .514 .361 - -
MED .459 .324 .492 .344 .504 .355 .509 .357 .510 .358 .512 .362 .514 .362
PairRE DT .314 .198 .395 .262 .452 .312 .476 .337 .505 .352 .518 .364 .522 .368
Ext .222 .158 .289 .187 .353 .236 .416 .283 .469 .325 .506 .354 .522 .368
Ext-L .249 .159 .294 .196 .333 .238 .436 .298 .489 .342 .513 .359 .522 .368
Ext-V .277 .181 .303 .192 .374 .250 .450 .307 .490 .343 .513 .362 .522 .368
BKD .332 .215 .407 .265 .453 .314 .487 .343 .508 .355 .521 .366 - -
TA .346 .226 .430 .291 .455 .316 .493 .347 .509 .358 .521 .368 - -
DualDE .342 .224 .427 .286 .456 .318 .495 .351 .512 .359 .524 .371 - -
MED .384 .253 .437 .299 .466 .327 .495 .346 .510 .357 .521 .366 .520 .368
Refer to caption
(a) TransE on WN18RR
Refer to caption
(b) RotatE on WN18RR
Refer to caption
(c) PairRE on WN18RR
Refer to caption
(d) TransE on FB15K237
Refer to caption
(e) RotatE on FB15K237
Refer to caption
(f) PairRE on FB15K237
Figure 6: Performance of sub-models of MED and the directly trained (DT) KGEs of dimensions from 10 to 640.

Appendix B Ablation Study

We conduct ablation studies to evaluate the effect of three modules in MED: the mutual learning mechanism (MLM), the evolutionary improvement mechanism (EIM), and the dynamic loss weight (DLW). Table 12 shows the MRR and Hit@k𝑘kitalic_k (k=1,3,10𝑘1310k=1,3,10italic_k = 1 , 3 , 10) of MED removing these modules respectively on WN18RR and TransE.

Table 12: Ablation study on WN18RR with TransE.
dim MED MED w/o MLM MED w/o EIM MED w/o DLW
MRR Hit@10 Hit@3 Hit@1 MRR Hit@10 Hit@3 Hit@1 MRR Hit@10 Hit@3 Hit@1 MRR Hit@10 Hit@3 Hit@1
10 .170 .388 .269 .036 .149 .335 .234 .032 .169 .388 .267 .037 .171 .387 .268 .035
20 .219 .491 .369 .042 .197 .437 .323 .032 .217 .488 .366 .044 .218 .487 .367 .039
40 .232 .518 .399 .048 .224 .496 .379 .029 .232 .517 .403 .042 .232 .517 .402 .037
80 .232 .523 .404 .042 .228 .521 .399 .033 .235 .529 .408 .037 .234 .523 .410 .041
160 .236 .529 .407 .037 .234 .525 .406 .034 .234 .527 .405 .032 .235 .527 .405 .032
320 .237 .536 .410 .033 .236 .532 .409 .035 .233 .530 .398 .031 .234 .533 .405 .029
640 .237 .537 .412 .031 .238 .535 .412 .042 .232 .528 .402 .029 .233 .530 .396 .025

B.1 Mutual Learning Mechanism (MLM)

We remove the mutual learning mechanism from MED and keep the other parts unchanged, where (5) is rewritten as

L=i=1nexp(w3didn)LEIi.𝐿superscriptsubscript𝑖1𝑛subscript𝑤3subscript𝑑𝑖subscript𝑑𝑛superscriptsubscript𝐿𝐸𝐼𝑖L=\sum_{i=1}^{n}\exp\left(\frac{w_{3}\cdot d_{i}}{d_{n}}\right)\cdot L_{EI}^{i}.italic_L = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_exp ( divide start_ARG italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⋅ italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ) ⋅ italic_L start_POSTSUBSCRIPT italic_E italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT . (6)

From the result of “MED w/o MLM” in Table 12, we find that after removing the mutual learning mechanism, the performance of low-dimensional sub-models deteriorates seriously since the low-dimensional sub-models can not learn from the high-dimensional sub-models. For example, the MRR of the 10-dimensional sub-model decreased by 12.4%percent12.412.4\%12.4 %, and the MRR of the 20-dimensional sub-model decreased by 10%percent1010\%10 %. While the performance degradation of the high-dimensional sub-model is not particularly obvious, and the MRR of the highest-dimensional sub-model (dim=640𝑑𝑖𝑚640dim=640italic_d italic_i italic_m = 640) is not worse than that of MED, which is because to a certain degree, removing the mutual learning mechanism also avoids the negative influence to high-dimensional sub-models from low-dimensional sub-models. On the whole, this mechanism greatly improves the performance of low-dimensional sub-models.

B.2 Evolutionary Improvement Mechanism (EIM)

In this part, we replace evolutionary improvement loss LEIisuperscriptsubscript𝐿𝐸𝐼𝑖L_{EI}^{i}italic_L start_POSTSUBSCRIPT italic_E italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT in (5) with the regular KGE loss LKGEisuperscriptsubscript𝐿𝐾𝐺𝐸𝑖L_{KGE}^{i}italic_L start_POSTSUBSCRIPT italic_K italic_G italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT:

LKGEi=(h,r,t)𝒯𝒯ylogσ(s(h,r,t)i)+(1y)log(1σ(s(h,r,t)i)).superscriptsubscript𝐿𝐾𝐺𝐸𝑖subscript𝑟𝑡𝒯superscript𝒯𝑦𝜎superscriptsubscript𝑠𝑟𝑡𝑖1𝑦1𝜎superscriptsubscript𝑠𝑟𝑡𝑖\displaystyle L_{KGE}^{i}=\sum_{(h,r,t)\in\mathcal{T}\cup\mathcal{T}^{-}}y\log% \sigma(s_{(h,r,t)}^{i})+(1-y)\log(1-\sigma(s_{(h,r,t)}^{i})).italic_L start_POSTSUBSCRIPT italic_K italic_G italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT ( italic_h , italic_r , italic_t ) ∈ caligraphic_T ∪ caligraphic_T start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_y roman_log italic_σ ( italic_s start_POSTSUBSCRIPT ( italic_h , italic_r , italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) + ( 1 - italic_y ) roman_log ( 1 - italic_σ ( italic_s start_POSTSUBSCRIPT ( italic_h , italic_r , italic_t ) end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) ) . (7)

From the result of “MED w/o EIM” in Table 12, we find that removing the evolutionary improvement mechanism mainly degrades the performance of high-dimensional sub-models. While due to the existence of the mutual learning mechanism, the low-dimensional sub-model can still learn from the high-dimensional sub-model, so as to ensure the certain performance of the low-dimensional sub-model. In addition, we also find that as the dimension increases to a certain extent, the performance of the sub-model does not improve, and even begins to decline. We guess that this is because the mutual learning mechanism makes every pair of neighbor sub-models learn from each other, resulting in some low-quality or wrong knowledge gradually transferring from the low-dimensional sub-models to the high-dimensional sub-models, and when the evolutionary improvement mechanism is removed, the high-dimensional sub-models can no longer correct the wrong information from the low-dimensional sub-models. The higher the dimension of the sub-model, the more the accumulated error, so the performance of the high-dimensional sub-models is seriously damaged. On the whole, this mechanism mainly helps to improve the effect of high-dimensional sub-models.

B.3 Dynamic Loss Weight (DLW)

To study the effect of the dynamic loss weight, we fix the ratio of all mutual learning losses to all evolutionary improvement losses as 1:1:111:11 : 1, and (5) is rewritten as

L=i=2nLMLi1,i+i=1nLEIi.𝐿superscriptsubscript𝑖2𝑛superscriptsubscript𝐿𝑀𝐿𝑖1𝑖superscriptsubscript𝑖1𝑛superscriptsubscript𝐿𝐸𝐼𝑖L=\sum_{i=2}^{n}L_{ML}^{i-1,i}+\sum_{i=1}^{n}L_{EI}^{i}.italic_L = ∑ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_M italic_L end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 , italic_i end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_E italic_I end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT . (8)

According to the result of “MED w/o DLW” in Table 12, the overall results of “MED w/o DLW” are in the middle of the results of “MED w/o MLM” and “MED w/o EIM”: the performance of the low-dimensional sub-model is better than that of “MED w/o MLM”, and the performance of the high-dimensional sub-model is better than that of “MED w/o EIM”. On the whole, its results are more similar to “MED w/o EIM”, that is, the performance of the low-dimensional sub-model does not change much, while the performance of the high-dimensional sub-model decreases more significantly. We believe that for the high-dimensional sub-model, the proportion of mutual learning loss is still too large, which makes it more negatively affected by the low-dimensional sub-model. This result indicates that the dynamic loss weight plays a role in adaptively balancing multiple losses and contributes to improving overall performance.

Appendix C Details of applying the trained KGE by MED to real applications

The SKG is used in many tasks related to users, and injecting user embeddings trained over SKG into downstream task models is a common and practical way.

User labeling is one of the common user management tasks that e-commerce platforms run on backend servers. We model user labeling as a multiclass classification task for user embeddings with a 2-layer MLP:

=1|𝒰|i=1|𝒰|j=1|𝒞𝒮|yijlog(MLP(ui)),1𝒰superscriptsubscript𝑖1𝒰superscriptsubscript𝑗1𝒞𝒮subscript𝑦𝑖𝑗MLPsubscript𝑢𝑖\mathcal{L}=-\frac{1}{|\mathcal{U}|}\sum_{i=1}^{|\mathcal{U}|}\sum_{j=1}^{|% \mathcal{CLS}|}y_{ij}\log(\mathrm{MLP}(u_{i})),caligraphic_L = - divide start_ARG 1 end_ARG start_ARG | caligraphic_U | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_U | end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_C caligraphic_L caligraphic_S | end_POSTSUPERSCRIPT italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT roman_log ( roman_MLP ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) , (9)

where uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th user’s embedding, the label yij=1subscript𝑦𝑖𝑗1y_{ij}=1italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 1 if user uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT belongs to class clsj𝑐𝑙subscript𝑠𝑗cls_{j}italic_c italic_l italic_s start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, otherwise yij=0subscript𝑦𝑖𝑗0y_{ij}=0italic_y start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 0.

The product recommendation task is to properly recommend items to users that users will interact with a high probability and it often runs on terminal devices. Following PKGM DBLP:conf/icde/ZhangWYWZC21 , which recommends items to users using the neural collaborative filtering (NCF) DBLP:conf/www/HeLZNHC17 framework with the help of pre-trained user embeddings as service vectors, we add trained user embeddings over SKG as service vectors to NCF. In NCF, the MLP layer is used to learn item-user interactions based on the latent feature of the user and item, that is, for a given user-item pair useriitemj𝑢𝑠𝑒subscript𝑟𝑖𝑖𝑡𝑒subscript𝑚𝑗user_{i}-item_{j}italic_u italic_s italic_e italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_i italic_t italic_e italic_m start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, the interaction function is

ϕ1MLP(pi,qj)=MLP([pi;qj]),superscriptsubscriptitalic-ϕ1𝑀𝐿𝑃subscript𝑝𝑖subscript𝑞𝑗MLPsubscript𝑝𝑖subscript𝑞𝑗\phi_{1}^{MLP}\left(p_{i},q_{j}\right)=\mathrm{MLP}([p_{i};q_{j}]),italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M italic_L italic_P end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = roman_MLP ( [ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] ) , (10)

where pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and qjsubscript𝑞𝑗q_{j}italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are latent feature vectors of user and item learned in NCF. We add the trained user embedding uisubscript𝑢𝑖u_{i}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to NCF’s MLP layer and rewrite Equation (10) as

ϕ1MLP(pi,qj,ui)=MLP([pi;qj;ui]),superscriptsubscriptitalic-ϕ1𝑀𝐿𝑃subscript𝑝𝑖subscript𝑞𝑗subscript𝑢𝑖MLPsubscript𝑝𝑖subscript𝑞𝑗subscript𝑢𝑖\phi_{1}^{MLP}\left(p_{i},q_{j},u_{i}\right)=\mathrm{MLP}([p_{i};q_{j};u_{i}]),italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M italic_L italic_P end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = roman_MLP ( [ italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_q start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ) , (11)

and the other parts of NCF stay the same as in PKGM DBLP:conf/icde/ZhangWYWZC21 .

We train entity and relation embeddings for SKG based on TransE DBLP:conf/nips/BordesUGWY13 and input the trained entity (user) embedding into Equation (9) and Equation (11).

Appendix D Details of extending MED to language model BERT-base

D.1 Dataset and Evaluation Metric

For the experiments extending MED to BERT, we adopt the common GLUE DBLP:conf/iclr/WangSMHLB19 benchmark for evaluation. To be specific, we use the development set of the GLUE benchmark which includes four tasks: Paraphrase Similarity Matching, Sentiment Classification, Natural Language Inference, and Linguistic Acceptability. For Paraphrase Similarity Matching, we use MRPC DBLP:conf/acl-iwp/DolanB05 , QQP and STS-B DBLP:conf/lrec/ConneauK18 for evaluation. For Sentiment Classification, we use SST-2 DBLP:conf/emnlp/SocherPWCMNP13 . For Natural Language Inference, we use MNLI DBLP:conf/naacl/WilliamsNB18 , QNLI DBLP:conf/emnlp/RajpurkarZLL16 , and RTE for evaluation. In terms of evaluation metrics, we follow previous work DBLP:conf/naacl/DevlinCLT19 ; DBLP:conf/emnlp/SunCGL19 . For MRPC and QQP, we report F1 and accuracy. For STS-B, we consider Pearson and Spearman correlation as our metrics. The other tasks use accuracy as the metric. For MNLI, the results of MNLI-m and MNLI-mm are both reported separately.

D.2 Baselines

For comparison, we choose Knowledge Distillation (KD) models and Hardware-Aware Transformers DBLP:conf/acl/WangWLCZGH20 (HAT) customized for transformers as baselines. For the KD models, we compare MED with Basic KD (BKD) DBLP:journals/corr/HintonVD15 , Patient KD (PKD) DBLP:conf/emnlp/SunCGL19 , Relational Knowledge Distillation (RKD) DBLP:conf/cvpr/ParkKLC19 , Deep Self-attention Distillation (MiniLM) DBLP:conf/nips/WangW0B0020 , Meta Learning-based KD (MetaDistill) DBLP:conf/acl/ZhouXM22 and Feature Structure Distillation (FSD) DBLP:journals/eswa/JungKNK23 . For the comparability of the results, we choose 4-layer BERT (BERT4) or 6-layer BERT (BERT6) as the student model architectures, which guarantees that the number of model parameters (#P(M)) or speedup is comparable. For HAT, we use the same model architecture as our MED for training and show the results of sub-models with three parameter scales.

D.3 Implementation

To implement MED on BERT, for the word embedding layer, all sub-models share the front portion of embedding parameters in the same way as in KGE, and for the transformer layer, all sub-models share the front portion of weight parameters as in HAT DBLP:conf/acl/WangWLCZGH20 . Specifically, assuming that the embedding dimension of the largest BERT model Bnsubscript𝐵𝑛B_{n}italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is dnsubscript𝑑𝑛d_{n}italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, and the embedding dimension of the sub-model Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, for any parameter matrix with the shape x×y𝑥𝑦x\times yitalic_x × italic_y in Bnsubscript𝐵𝑛B_{n}italic_B start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, the front portion sub-matrix of it with the shape didnx×didnysubscript𝑑𝑖subscript𝑑𝑛𝑥subscript𝑑𝑖subscript𝑑𝑛𝑦\frac{d_{i}}{d_{n}}x\times\frac{d_{i}}{d_{n}}ydivide start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG italic_x × divide start_ARG italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG italic_y is the parameter matrix of the corresponding position in Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Finally, it just need to replace the triple score s(h,r,t)subscript𝑠𝑟𝑡s_{(h,r,t)}italic_s start_POSTSUBSCRIPT ( italic_h , italic_r , italic_t ) end_POSTSUBSCRIPT in Equation (1), Equation (2), Equation (3), and Equation (4) with the logits output for the corresponding category of the classifier in the classification task.

We set n=4𝑛4n=4italic_n = 4 for BERT applying MED, and 4 sub-models have the following settings: [768, 512, 256, 128] for embedding dim and [768, 512, 256, 128] for hidden dim, [12, 12, 6, 6] for the head number in attention modules, 12 for encoder layer number.