Croppable Knowledge Graph Embedding

Yushan Zhu
Zhejiang University &Wen Zhang
Zhejiang University &Zhiqiang Liu
Zhejiang University &Mingyang Chen
Zhejiang University &Lei Liang
Ant Group &Huajun Chen
Zhejiang University

Abstract

Knowledge Graph Embedding (KGE) is a common method for Knowledge Graphs (KGs) to serve various artificial intelligence tasks. The suitable dimensions of the embeddings depend on the storage and computing conditions of the specific application scenarios. Once a new dimension is required, a new KGE model needs to be trained from scratch, which greatly increases the training cost and limits the efficiency and flexibility of KGE in serving various scenarios. In this work, we propose a novel KGE training framework MED, through which we could train once to get a croppable KGE model applicable to multiple scenarios with different dimensional requirements, sub-models of the required dimensions can be cropped out of it and used directly without any additional training. In MED, we propose a mutual learning mechanism to improve the low-dimensional sub-models performance and make the high-dimensional sub-models retain the capacity that low-dimensional sub-models have, an evolutionary improvement mechanism to promote the high-dimensional sub-models to master the knowledge that the low-dimensional sub-models can not learn, and a dynamic loss weight to balance the multiple losses adaptively. Experiments on 3 KGE models over 4 standard KG completion datasets, 3 real application scenarios over a real-world large-scale KG, and the experiments of extending MED to the language model BERT show the effectiveness, high efficiency, and flexible extensibility of MED.

1 Introduction

Knowledge Graphs (KGs) are composed of triples representing facts in the form of (head entity, relation, tail entity), abbreviated as (h, r, t). KG has been widely used in recommendation systems DBLP:conf/mm/ZhuZZYCZC21 ; DBLP:conf/icde/ZhangWYWZC21 , information extraction DBLP:conf/acl/HoffmannZLZW11 ; DBLP:conf/i-semantics/DaiberJHM13 , question answering DBLP:journals/corr/ZhangLHJLW016 ; DBLP:conf/www/DiefenbachSM18 and other tasks. A common way to apply a knowledge graph is to represent the entities and relations in the knowledge graph into continuous vector spaces, called knowledge graph embedding (KGE) DBLP:conf/nips/BordesUGWY13 ; DBLP:conf/iclr/SunDNT19 , and then use the vector representation of entities and relations to serve a variety of tasks.

KGEs with higher dimensions have greater expressive power and usually achieve better performance, but this also means a larger number of parameters and requires more storage space and computing resources DBLP:conf/wsdm/ZhuZCCC0C22 ; DBLP:conf/acl/Sachan20 . The appropriate dimensions of the KGE are different for different devices or scenarios. As shown in Fig. 1, large remote servers have large storage space and sufficient computing resources to support high-dimensional KGE with good performance, while small and medium-sized terminal devices, such as vehicle-mounted systems or smartphones, can only accept low-dimensional KGE due to limited computing power and storage capacity. Therefore, according to the conditions of different devices or scenes, people tend to train the KGE with appropriate dimensions and as high quality as possible. However, the challenge is that once a new dimension is required, a new KGE needs to be trained from scratch. Especially when only low-dimensional KGE can be applied, to ensure good performance, the additional model compression technology such as knowledge distillation DBLP:journals/corr/HintonVD15 ; DBLP:conf/wsdm/ZhuZCCC0C22 is needed during training. This significantly increases training costs and limits KGE’s efficiency and flexibility in serving different scenarios.

Refer to caption — Figure 1: Diverse KGE dimensions for a KG.

Thus a new concept "croppable KGE" is proposed and we are interested in the research question that is it possible to train a croppable KGE, with which KGEs of various required dimensions can be cropped out of it, directly be used without any additional training, and achieve promising performance?

In this work, our main idea of croppable KGE learning is to train an entire KGE that contains many sub-models of different dimensions in it. These sub-models share their embedding parameters and are trained simultaneously. The goal is that the low-dimensional sub-models can benefit from the more expressive high-dimensional sub-models, while the high-dimensional sub-models retain the ability of the low-dimensional sub-models and master the knowledge that the low-dimensional sub-models cannot. Based on this idea, we propose a croppable KGE training framework MED, which consists of three main modules, the Mutual learning mechanism, the Evolutionary improvement mechanism, and the Dynamic loss weight to achieve the above purpose. Specifically, the mutual learning mechanism is based on knowledge distillation and it makes pairwise neighbor sub-models learn from each other, so that the performance of the lower-dimensional sub-model can be improved, and the higher-dimensional sub-model can retain the ability of the lower-dimensional sub-model. The evolutionary improvement mechanism helps the high-dimensional sub-model master more knowledge that the low-dimensional sub-model cannot by making the high-dimensional sub-model pay more attention to learn the triples that the low-dimensional sub-model can’t correctly predict. The dynamic loss weight is designed to adaptively balance multiple losses of different sub-models according to their dimensions and further improve the overall performance.

We evaluate the effectiveness of our proposed MED by implementing it on three typical KGE methods and four standard KG datasets. We also prove its practical value by applying MED to a real-world large-scale KG and downstream tasks. Furthermore, we demonstrate the extensibility of MED by implementing it on language model BERT DBLP:conf/naacl/DevlinCLT19 and GLUE DBLP:conf/iclr/WangSMHLB19 benchmarks. The experimental results show that (1) MED successfully trains a croppable KGE model available for various dimensional requirements, which contains multiple parameter-shared sub-models of different dimensions that of high performance and can be used directly without additional training; (2) the training efficiency of MED is far higher than that of independently training multiple KGE models of different sizes or obtaining them by knowledge distillation. (3) MED can be flexibly extended to other neural network models besides KGE and achieve good performance; (4) our proposed mutual learning mechanism, evolutionary improvement mechanism, and dynamic loss weight are effective and necessary for MED to achieve overall optimal performance. In summary, our contributions are as follows:

•

We propose a new research question and task: training croppable KGE, from which KGEs of different dimensions can be cropped and used directly without any additional training.
•

We propose a novel framework MED, including a mutual learning mechanism, an evolutionary improvement mechanism, and a dynamic loss weight, to ensure the overall performance of all sub-models during training the croppable KGE.
•

We experimentally prove that all sub-models of MED work well, especially the performance of the low-dimensional sub-models exceeding the KGE with the same dimension trained by the state-of-the-art distillation-based methods. MED also shows excellent performance in real-world applications and good extensibility on other types of neural networks.

2 Related Work

This work is to achieve a croppable KGE that meets different dimensional requirements. One of the most common methods to obtain a good-performance KGE of the target dimension is utilizing knowledge distillation with a high-dimensional powerful teacher KGE. Thus, we focus on two research fields most relevant to our work: knowledge graph embedding and knowledge distillation.

2.1 Knowledge Graph Embedding

Knowledge graph embedding (KGE) technology has been widely applied with the key idea of map** entities and relations of a KG into continuous vector spaces as vector representations, which can further serve various KG downstream tasks. TransE DBLP:conf/nips/BordesUGWY13 is the most representative translation-based KGE method by regarding the relation as a translation from the head to tail entity. Variants of TransE include TransH DBLP:conf/aaai/WangZFC14 , TransR DBLP:conf/aaai/LinLSLZ15 , TransD DBLP:conf/acl/JiHXL015 and so on. RESCAL DBLP:conf/icml/NickelTK11 is the first one based on vector decomposition, and then to improve it, DistMult DBLP:journals/corr/YangYHGD14a , ComplEx DBLP:conf/icml/TrouillonWRGB16 , and SimplE DBLP:conf/nips/Kazemi018 are proposed. RotatE DBLP:conf/iclr/SunDNT19 is a typical rotation-based method that regards the relation as the rotation between the head and tail entities. QuatE DBLP:conf/nips/0007TYL19 and DihEdral DBLP:conf/acl/XuL19 work with a similar idea. PairRE DBLP:conf/acl/ChaoHWC20 uses two relation vectors to project the head and tail entities into an Euclidean space to encode complex relational patterns. With the development of neural networks, KGEs based on graph neural networks (GNNs) DBLP:conf/aaai/DettmersMS018 ; DBLP:conf/naacl/NguyenNNP18 ; DBLP:conf/esws/SchlichtkrullKB18 ; DBLP:conf/iclr/VashishthSNT20 are also proposed. Although the KGEs are simple and effective, there is an obvious challenge: In different scenarios, the required KGE dimensions are different, which depends on the storage and computing resources of the device. It has to train a new KGE model from scratch for a new dimension requirement, which greatly increases the training cost and limits the flexibility for KGE to serve diversified scenarios.

2.2 Knowledge Distillation

High-dimensional KGEs have strong expression ability due to the large number of parameters, but require a lot of storage and computing resources, and are not suitable for all scenarios, especially small devices. To solve this problem, a common way is to compress a high-dimensional KGE to the target low-dimensional KGE by knowledge distillation DBLP:journals/corr/HintonVD15 ; DBLP:conf/aaai/MirzadehFLLMG20 and quantization DBLP:conf/acl/BaiZHSJJLLK20 ; DBLP:conf/iclr/StockFGGGJJ21 technology.

Quantization replaces continuous vector representations with lower-dimensional discrete codes. TS-CL DBLP:conf/acl/Sachan20 is the first work of KGE compression applying quantization. LightKG DBLP:conf/cikm/WangWLG21 uses a residual module to induce diversity among codebooks. However, quantization cannot improve the inference speed so it’s still not suitable for devices with limited computing resources.

Knowledge distillation (KD) has been widely used in Computer Vision DBLP:conf/aaai/MirzadehFLLMG20 and Natural Language Processing DBLP:conf/naacl/DevlinCLT19 ; DBLP:conf/emnlp/SunCGL19 , hel** reduce the model size and increase the inference speed. The core idea is to use the output of a large teacher model to guide the training of a small student model. DualDE DBLP:conf/wsdm/ZhuZCCC0C22 is a representative KD-based work to transfer the knowledge of high-dimensional KGE to low-dimensional KGE. It considers the mutual influences between the teacher and student and finetunes the teacher during training.MulDE DBLP:conf/www/Wang0MS21 transfers the knowledge from multiple low-dimensional teacher models to a student model for hyperbolic KGE. ISD DBLP:journals/corr/abs-2206-02963 improves low-dimensional KGE by making it play the teacher and student roles alternatively during training. Among these methods, DualDE DBLP:conf/wsdm/ZhuZCCC0C22 is more relevant to our work, both have the setting of high-dimensional teacher and low-dimensional student models. In this work, we propose a novel KD-based KGE training framework MED, one training can obtain a croppable KGE that meets multiple dimensional requirements.

3 Preliminary

Table 1: Score functions.

KGE method	Scoring Function $f(\mathbf{h},\mathbf{r},\mathbf{t})$
TransE DBLP:conf/nips/BordesUGWY13	$-\left\\|\mathbf{h}+\mathbf{r}-\mathbf{t}\right\\|$
RotatE DBLP:conf/iclr/SunDNT19	$-\left\\|\mathbf{h}\circ\mathbf{r}-\mathbf{t}\right\\|$
PairRE DBLP:conf/acl/ChaoHWC20	$-\left\\|\mathbf{h}\circ\mathbf{r}^{H}-\mathbf{t}\circ\mathbf{r}^{T}\right\\|$

Knowledge graph embedding (KGE) methods aim to express the relations between entities in a continuous vector space through a scoring function $f$ . Specifically, given a knowledge graph $\mathcal{G}=(\mathcal{E},\mathcal{R},\mathcal{T})$ where $\mathcal{E}$ , $\mathcal{R}$ and $\mathcal{T}$ are the sets of entities, relations and all observed triples, we utilize the triple scoring function to measure the plausibility of triples in the embedding space for a triple $(h,r,t)$ where $h\in\mathcal{E},r\in\mathcal{R}$ and $t\in\mathcal{E}$ . The triple score function is denoted as $s_{(h,r,t)}=f(\mathbf{h},\mathbf{r},\mathbf{t})$ with embeddings of head entity h, relation r and tail entity t as input. Table 1 summarizes the scoring functions of some popular KGE methods, where $\circ$ is the Hadamard product. The higher the triple score, the more likely the model is to judge the triples as true.

4 MED Framework

As shown in Fig. 2, our croppable KGE framework MED contains multiple (let’s say $n$ ) sub-models of different dimensions in it, denoted as $M_{i}(i=1,2...,n)$ with dimension of $d_{i}$ . Each sub-model

$M_{i}$ is composed of the first $d_{i}$ dimensions of the whole embedding and the score of triple $(h,r,t)$ output by $M_{i}$ is $s_{(h,r,t)}^{i}=f(\mathbf{h}[0$ $:$ $d_{i}],\mathbf{r}[0$ $:$ $d_{i}],\mathbf{t}[0$ $:$ $d_{i}])$ , where $\mathbf{h}[0$ $:$ $d_{i}]$ represents the first $d_{i}$ elements of vector $\mathbf{h}$ . The parameters of sub-model $M_{i}$ are shared by all sub-models $M_{j}(i$ $<$ $j$ $\leqslant$ $n)$ that are higher-dimensional than it. The number of sub-models $n$ and the specific dimension of each sub-model $d_{i}$ can be set according to the actual application needs. For low-dimensional sub-models, we want to improve their performance as much as possible. For high-dimensional sub-models, we hope they cover the abilities that low-dimensional sub-models already have and master the knowledge that low-dimensional sub-models can not learn well, that is, they need to correctly predict not only the triples that low-dimensional sub-models can predict correctly but also those low-dimensional sub-models predict wrongly.

MED is based on knowledge distillation DBLP:journals/corr/HintonVD15 ; DBLP:journals/corr/abs-1903-12136 ; DBLP:conf/naacl/DevlinCLT19 technique that the student learns by fitting the hard (ground-truth) label and the soft label from the teacher simultaneously. In MED, we first propose a mutual learning mechanism that makes low-dimensional sub-models learn from high-dimensional sub-models to achieve better performance, and makes high-dimensional sub-models also learn from low-dimensional sub-models to retain the abilities that low-dimensional sub-models already have. Then, we propose an evolutionary improvement mechanism to enable high-dimensional sub-models to master the knowledge that the low-dimensional sub-models can not learn well. Finally, we train MED with dynamic loss weight to adaptively balance multiple optimization objectives of sub-models.

4.1 Mutual Learning Mechanism

We treat each sub-model $M_{i}$ as the student of its higher-dimensional neighbor sub-model $M_{i+1}$ to achieve better performance, since high-dimensional KGEs usually have more expressive power than low-dimensional ones due to more parameters DBLP:conf/acl/Sachan20 ; DBLP:conf/wsdm/ZhuZCCC0C22 . We also treat sub-model $M_{i}$ as the student of its lower-dimensional neighbor sub-model $M_{i-1}$ , so the higher-dimensional sub-model can review what the lower-dimensional sub-model has learned and retain the low-dimensional one’s existing abilities. Thus, pairwise neighbor sub-models serve as both teachers and students, learning from each other. The mutual learning loss between each pair of neighbor sub-models is

L_{ML}^{i-1,i}=\sum_{(h,r,t)\in\mathcal{T}\cup\mathcal{T}^{-}}d_{\delta}\left(% s_{(h,r,t)}^{i-1},s_{(h,r,t)}^{i}\right),1<i\leqslant n,

(1)

where $s_{(h,r,t)}^{i}$ is the score of triple $(h,r,t)$ output by sub-model $M_{i}$ and reflects the possibility that this triplet exists, $\mathcal{T}^{-}=\mathcal{E}\times\mathcal{R}\times\mathcal{E}\setminus\mathcal% {T}$ is the negative triple set, $n$ is the number of sub-models, and $d_{\delta}$ is Huber loss Huber1964Robust with $\delta=1$ commonly used in knowledge distillation for KGE DBLP:conf/wsdm/ZhuZCCC0C22 . MED makes each sub-model only learn from its neighbor sub-models. The advantage is that this not only reduces the computational complexity of training but also makes every pair of teacher and student models have a relatively small dimension gap, which is important and effective because the large gap of dimensions between teacher and student will destroy the distillation effect DBLP:conf/aaai/MirzadehFLLMG20 ; DBLP:conf/wsdm/ZhuZCCC0C22 .

4.2 Evolutionary Improvement Mechanism

The hard (ground-truth) label is the other important supervision signal during training in knowledge distillation DBLP:journals/corr/HintonVD15 . High-dimensional sub-models need to master triples that low-dimensional sub-models can not learn well, that is, high-dimensional sub-models need to correctly predict those positive (negative) triples that are wrongly predicted to be negative (positive) by low-dimensional sub-models. In MED, for a given triple $(h,r,t)$ , the optimization weight in sub-model $M_{i}$ for it depends on the triple score output by the previous sub-model $M_{i-1}$ .

For a positive triple, the optimization weight of the model $M_{i}$ for it is negatively correlated with its score by the model $M_{i-1}$ . Specifically, the higher its score from the model $M_{i-1}$ (meaning that $M_{i-1}$ has been able to correctly judge it as a positive sample), the lower the optimization weight of the model $M_{i}$ for it, and the lower its score from the model $M_{i-1}$ (meaning that $M_{i-1}$ wrongly judges it as a negative sample), the higher the optimization weight of the model $M_{i}$ for it because $M_{i-1}$ cannot predict this triple well. The optimization weight of $M_{i}$ for the positive triple is

pos_{h,r,t}^{i}=\frac{\exp w_{1}/s_{(h,r,t)}^{i-1}}{\sum_{(h,r,t)\in T_{batch}% }\exp w_{1}/s_{(h,r,t)}^{i-1}}\ \texttt{if}\ 1<i\leqslant n\ ;\quad\frac{1}{|T% _{batch}|}\ \texttt{if}\ i=1,

(2)

where $s_{(h,r,t)}^{i-1}$ is the score for triple $(h,r,t)$ output by the sub-model $M_{i-1}$ , $T_{batch}$ is the set of positive triples within a batch, and $w_{1}$ is a learnable scaling parameter. Conversely, for a negative triple, the optimization weight of the model $M_{i}$ for it is positively correlated with its score by the model $M_{i-1}$ . The optimization weight of $M_{i}$ for the negative triple is

neg_{h,r,t}^{i}=\frac{\exp w_{2}\cdot s_{(h,r,t)}^{i-1}}{\sum_{(h,r,t)\in T_{% batch}^{-}}\exp w_{2}\cdot s_{(h,r,t)}^{i-1}}\ \texttt{if}\ 1<i\leqslant n\ ;% \quad\frac{1}{|T_{batch}^{-}|}\ \texttt{if}\ i=1,

(3)

where $T_{batch}^{-}$ is the set of negative triples within a batch, and $w_{2}$ is a learnable scaling parameter.

Therefore, the evolutionary improvement loss of the sub-model $M_{i}$ is

\displaystyle L_{EI}^{i}=-\sum_{(h,r,t)\in\mathcal{T}\cup\mathcal{T}^{-}}pos_{% h,r,t}^{i}\cdot y\log\sigma(s_{(h,r,t)}^{i})+neg_{h,r,t}^{i}\cdot(1-y)\log(1-% \sigma(s_{(h,r,t)}^{i})),

(4)

where $\sigma$ is the Sigmoid activation function, $y$ is the ground-truth label of the triple $(h,r,t)$ , and it is $1$ for positive triples and $0$ for negative ones. In each sub-model, different hard (ground-truth) label loss weights are set for different triples, and the high-dimensional sub-model will pay more attention to learn the triple that the low-dimensional sub-model can not learn well.

4.3 Dynamic Loss Weight

Since MED involves the optimization of multiple sub-models, we set dynamic loss weights during training. Initially, low-dimensional sub-models prioritize learning from high-dimensional sub-models to improve performance. This means low-dimensional sub-models rely more on soft label information, so for low-dimensional sub-models, evolutionary improvement loss should account for less than mutual learning loss. Conversely, high-dimensional sub-models should focus more on capturing knowledge that low-dimensional models lack, while mitigating the impact of low-quality outputs from low-dimensional models to maintain their good performance, that is, high-dimensional sub-models rely more on hard label information. So for high-dimensional sub-models, evolutionary improvement loss should account for more than mutual learning loss. For a teacher-student pair, their mutual learning loss acts on both teacher and student models simultaneously, so the effect of mutual learning loss for them is theoretically the same. We set different evolutionary improvement loss weights for different sub-models, and the final training loss function of MED is

L=\sum_{i=2}^{n}L_{ML}^{i-1,i}+\sum_{i=1}^{n}\exp(\frac{w_{3}\cdot d_{i}}{d_{n% }})\cdot L_{EI}^{i},

(5)

where $w_{3}$ is a learnable scaling parameter, and $d_{i}$ is the dimension of the $i$ th sub-model.

5 Experiment

We evaluate MED on typical KGE and GLUE benchmarks and particularly answer the following research questions: (RQ1) Is it capable for MED to train a croppable KGE at once that multiple sub-models of different dimensions can be cropped from it and all achieve promising performance? (RQ2) Can MED finally achieve parameter-efficient KGE models? (RQ3) Does MED work in real-world applications? (RQ4) Can MED be extended to other neural networks besides KGE?

5.1 Experiment Setting

5.1.1 Dataset and KGE methods

MED is universal and can be applied to any KGE method with a triple score function, we select three commonly used KGE methods as examples: TransE DBLP:conf/nips/BordesUGWY13 , RotatE DBLP:conf/iclr/SunDNT19 and PairRE DBLP:conf/acl/ChaoHWC20 , the triple score functions are described in Table 1.

Table 2: Statistics of datasets.

Dataset	#Ent.	#Rel.	#Train	#Valid	#Test
WN18RR	40,943	11	86,835	3,034	3,134
FB15K237	14,541	237	272,115	17,535	20,466
CoDEx-L	77,951	69	551,193	30,622	30,622
YAGO3-10	123,143	37	1,079,040	4,978	4,982
SKG	6,974,959	15	50,775,620	-	-

We conduct comparison experiments on two common KG completion benchmark datasets WN18RR DBLP:conf/emnlp/ToutanovaCPPCG15 and FB15K237 DBLP:conf/aaai/DettmersMS018 and two more larger-scale KGs CoDEx-L DBLP:conf/emnlp/SafaviK20 and YAGO3-10 DBLP:conf/cidr/MahdisoltaniBS15 . Besides, we apply our MED on a real-world large-scale e-commerce social knowledge graph (SKG) involving more than 50 million triples of social records by about 7 million users in the Taobao platform in real application scenarios. Table 2 shows the statistics of the datasets.

5.1.2 Evaluation Metric

For the link prediction task, we adopt standard metrics MRR and Hit@ $k$ $(k=1,3,10)$ in the filtered setting DBLP:conf/nips/BordesUGWY13 . We use Effi DBLP:conf/aaai/ChenZYZGPC23 , that is MRR/#P (#P is the number of parameters), to quantify the parameter efficiency of models. We use f1-score and accuracy for the user labeling task, and normalized discounted cumulative gain ndcg@ $k(k=5,10)$ for the product recommendation task.

5.1.3 Implementation

For the link prediction task, we set $d_{n}=640$ for the highest-dimensional sub-model $M_{n}$ and $d_{1}=10$ for the lowest-dimensional sub-model $M_{1}$ . We set $n=64$ and the same dimension size gap $10$ for every pair of neighbor sub-models so that there are a total of $64$ available sub-models of different dimensions from 10 to 640 in our croppable KGE model. The dimension of sub-model $M_{i}(i=1,2...,64)$ is $10\times i$ . For the user labeling and product recommendation task, we set $n=3$ and train the croppable KGE containing 3 sub-models: $M_{1}$ with $d_{1}=10$ for mobile phone (MB) terminals that are limited by storage and computing resources, $M_{2}$ with $d_{2}=100$ for the personal computer (PC), and $M_{3}$ with $d_{3}=500$ for the platform’s servers. We initialize the learnable scaling parameters $w_{i},w_{2}$ and $w_{3}$ in (2), (3) and (5) to 1. We implement MED by extending OpenKE DBLP:conf/emnlp/HanCLLLSL18 , an open-source KGE framework based on PyTorch. We set the batch size to $1024$ and the maximum training epoch to $3000$ with early stop**. For each positive triple, we generate $64$ negative triples by randomly replacing its head or tail entity with another entity. We use Adam DBLP:journals/corr/KingmaB14 optimizer with a linear decay learning rate scheduler and perform a search on the initial learning rate in $\{0.0001,0.0005,0.001,0.01\}$ . We train all sub-models simultaneously by optimizing the uniformly sampled sub-models from the full Croppable model in each step.

5.1.4 Baselines

For each required dimension $d_{r}$ , we extract the first $d_{r}$ dimensions from our croppable KGE as the target model and compare it to the KGE models obtained by 7 baselines of the following 3 types:

•

Directly training the target KGE model of requirement dimension $d_{r}$ , referred to as 1) DT. The directly trained highest-dimensional KGE model ( $d_{r}=d_{n}$ ) is marked as $M_{max}^{DT}$ .
•

Extracting the first $d_{r}$ dimensions from $M_{max}^{DT}$ as the target model, referred to as 2) Ext. Besides, we update $M_{max}^{DT}$ by assessing the importance of each one of 640 dimensions and arranging them in descending order before extracting as DBLP:conf/iclr/MolchanovTKAK17 ; DBLP:conf/acl/VoitaTMST19 : 3) Ext-L, the importance for each dimension of $M_{max}^{DT}$ is the variation of KGE loss on validation set after removing it; and 4) Ext-V, the importance for each dimension is the average absolute of its parameter weights of all entities and all relations.
•

Distilling the target KGE by KD methods: 5) BKD DBLP:journals/corr/HintonVD15 is the most basic one by minimizing the KL divergence of the output distributions of the teacher and student; 6) TA DBLP:conf/aaai/MirzadehFLLMG20 uses a medium-size teaching assistant (TA) model as a bridge for size gap, where TA model has the same dimension as the directly trained one whose MRR is closest to the average MRR of the teacher and student; and 7) DualDE DBLP:conf/wsdm/ZhuZCCC0C22 compresses KGE by optimizing the teacher and student simultaneously. We do not compare with MulDE DBLP:conf/www/Wang0MS21 , which uses multiple low-dimensional different KGE models as teachers to aggregate the knowledge of different KGE models into one rather than compress a high-dimensional KGE. In these baselines, $M_{max}^{DT}$ is the teacher, and other settings including hyperparameters are the same as their original papers.

5.2 Performance Comparison

We report the link prediction results of some representative dimensions in Table 3, more results of other dimensions and metrics are in Appendix A and the ablation studies are in Appendix B.

Table 3: MRR and Hit@10 (H10) of some dimensions on WN18RR (WN) and FB15K237 (FB).

		WN18RR								FB15K237
		10d		40d		160d		640d		10d		40d		160d		640d
KGE	Method	MRR	H10	MRR	H10	MRR	H10	MRR	H10	MRR	H10	MRR	H10	MRR	H10	MRR	H10
TransE	DT	0.121	0.287	0.214	0.496	0.233	0.531	0.237	0.537	0.150	0.235	0.299	0.477	0.315	0.499	0.322	0.508
	Ext	0.125	0.298	0.199	0.468	0.225	0.515	0.237	0.537	0.115	0.211	0.236	0.392	0.286	0.462	0.322	0.508
	Ext-L	0.139	0.315	0.224	0.497	0.236	0.534	0.237	0.537	0.109	0.194	0.232	0.381	0.285	0.462	0.322	0.508
	Ext-V	0.139	0.309	0.222	0.494	0.236	0.532	0.237	0.537	0.139	0.256	0.237	0.396	0.293	0.466	0.322	0.508
	BKD	0.141	0.323	0.226	0.513	0.233	0.531	-	-	0.176	0.293	0.303	0.480	0.315	0.501	-	-
	TA	0.144	0.335	0.226	0.512	0.234	0.533	-	-	0.175	0.246	0.303	0.484	0.319	0.504	-	-
	DualDE	0.148	0.337	0.225	0.514	0.235	0.533	-	-	0.179	0.301	0.306	0.483	0.319	0.505	-	-
	MED	0.170	0.388	0.232	0.518	0.236	0.529	0.237	0.537	0.196	0.341	0.308	0.486	0.320	0.505	0.322	0.507
RotatE	DT	0.172	0.418	0.456	0.556	0.471	0.567	0.476	0.575	0.254	0.424	0.312	0.495	0.322	0.506	0.325	0.515
	Ext	0.299	0.378	0.437	0.516	0.467	0.549	0.476	0.575	0.138	0.245	0.251	0.410	0.291	0.465	0.325	0.515
	Ext-L	0.206	0.277	0.399	0.487	0.445	0.541	0.476	0.575	0.135	0.243	0.221	0.365	0.280	0.453	0.325	0.515
	Ext-V	0.261	0.377	0.337	0.471	0.416	0.532	0.476	0.575	0.160	0.281	0.238	0.393	0.288	0.458	0.325	0.515
	BKD	0.175	0.434	0.457	0.556	0.472	0.570	-	-	0.277	0.442	0.314	0.503	0.322	0.510	-	-
	TA	0.177	0.438	0.459	0.558	0.473	0.572	-	-	0.280	0.447	0.313	0.501	0.323	0.510	-	-
	DualDE	0.179	0.440	0.462	0.559	0.473	0.573	-	-	0.282	0.449	0.315	0.502	0.322	0.512	-	-
	MED	0.324	0.469	0.466	0.561	0.471	0.574	0.476	0.574	0.288	0.459	0.318	0.504	0.323	0.510	0.324	0.514
PairRE	DT	0.220	0.321	0.415	0.472	0.449	0.534	0.453	0.544	0.182	0.314	0.284	0.452	0.319	0.505	0.332	0.522
	Ext	0.152	0.209	0.334	0.463	0.419	0.526	0.453	0.544	0.148	0.222	0.217	0.353	0.294	0.469	0.332	0.522
	Ext-L	0.162	0.220	0.363	0.442	0.437	0.523	0.453	0.544	0.150	0.249	0.219	0.333	0.309	0.489	0.332	0.522
	Ext-V	0.172	0.260	0.389	0.456	0.441	0.529	0.453	0.544	0.176	0.277	0.229	0.374	0.311	0.490	0.332	0.522
	BKD	0.228	0.336	0.421	0.483	0.451	0.536	-	-	0.198	0.332	0.288	0.453	0.321	0.508	-	-
	TA	0.245	0.340	0.426	0.487	0.452	0.537	-	-	0.208	0.346	0.292	0.455	0.323	0.509	-	-
	DualDE	0.242	0.336	0.428	0.495	0.453	0.540	-	-	0.207	0.342	0.293	0.456	0.326	0.512	-	-
	MED	0.317	0.376	0.433	0.502	0.451	0.541	0.451	0.542	0.239	0.384	0.303	0.466	0.324	0.510	0.330	0.520

MED outperforms baselines in almost all settings, especially for the extremely low dimensions. On WN18RR with $d$ =10, MED achieves an improvement of 14.9% and 15.1% on TransE, 8.4% and 6.6% on RotatE, 29.4% and 10.6% on PairRE compared with the best MRR and Hit@10 of baselines. We can observe a similar phenomenon on FB15K237. This benefits from the rich knowledge sources of low-dimensional models in MED: For sub-model $M_{i}$ , $M_{i+1}$ is the teacher directly next to it, while $M_{i+2}$ can also indirectly affect $M_{i}$ by directly affecting $M_{i+1}$ . Theoretically, all higher-dimensional sub-models can finally transfer their knowledge to low-dimensional sub-models through stepwise propagation. Although such stepwise propagation may have negative effects on high-dimensional models by bringing low-quality knowledge from low-dimensional sub-models, the evolutionary improvement mechanism in MED weakens the damage and makes high-dimensional ones still

achieve competitive performance than directly trained KGEs as in Fig. 3. We also find that Ext-based methods perform extremely unstable: Ext, Ext-L, and Ext-V work worse than DT except on WN18RR with TransE, indicating that only considering the importance of each dimension is not enough to guarantee the performance of all sub-models. More results and ablation studies are in Appendix A and Appendix B.

5.3 Parameter efficiency of MED

In Table 4, we compare our sub-models of suitable low dimensions to parameter-efficient KGEs especially proposed for large-scale KGs including NodePiece DBLP:conf/iclr/0001DWH22 and EARL DBLP:conf/aaai/ChenZYZGPC23 . In the case that the number of model parameters is roughly equivalent, the performance of the sub-models of MED exceeds that of the specialized parameter-efficient KGE methods. This demonstrates sub-models of our method are parameter efficient. More importantly, it can provide parameter-efficient models of different size for applications.

Table 4: Link prediction results on WN18RR, FB15K237, CoDEx-L and YAGO3-10.

	FB15k-237					WN18RR					CoDEx-L					YAGO3-10
	Dim	#P(M)	MRR	Hit@10	Effi	Dim	#P(M)	MRR	Hit@10	Effi	Dim	#P(M)	MRR	Hit@10	Effi	Dim	#P(M)	MRR	Hit@10	Effi
RotatE	1000	29.3	0.336	0.532	0.011	500	40.6	0.508	0.612	0.013	500	78	0.258	0.387	0.003	500	123.2	0.495	0.670	0.004
RotatE	100	2.9	0.296	0.473	0.102	50	4.1	0.411	0.429	0.100	25	3.8	0.196	0.322	0.052	20	4.8	0.121	0.262	0.025
+ NodePiece	100	3.2	0.256	0.420	0.080	100	4.4	0.403	0.515	0.092	100	3.6	0.190	0.313	0.053	100	4.1	0.247	0.488	0.060
+ EARL	150	1.8	0.310	0.501	0.172	200	3.8	0.440	0.527	0.116	100	2.1	0.238	0.390	0.113	100	3	0.302	0.498	0.101
+ MED	40	1.2	0.318	0.504	0.265	40	3.2	0.466	0.561	0.146	20	3.1	0.243	0.385	0.078	20	4.9	0.313	0.528	0.064

5.4 MED in real applications

We apply the trained croppable KGE with TransE on SKG to three real applications: the user labeling task on servers and the product recommendation task on PCs and mobile phones. Table 5 shows that our croppable user embeddings substantially exceed all baselines including directly trained (DT), the best baseline DualDE, and a common dimension reduction method in industry principal components

Table 5: Results on SKG.

	User Labeling		Product Recommendation
	server (500d)		PC terminal (100d)		MP terminal (10d)
Method	acc.	f1	ndcg@5	ndcg@10	ndcg@5	ndcg@10
DT	0.889	0.874	0.411	0.441	0.344	0.361
PCA	-	-	0.417	0.447	0.392	0.418
DualDE	-	-	0.423	0.456	0.404	0.433
MED	0.893	0.879	0.431	0.465	0.422	0.451

analysis (PCA) on $M^{DT}_{max}$ . Notably, the excellent performance on the mobile phone task (which can only carry embeddings with a maximum dimension of 10 limited by storage and computing resources) demonstrates the enormous practical value of our approach. More application details are in Appendix C.

5.5 Extend MED to Neural Networks

To verify the extensibility of our method to other neural networks, we take the language model BERT DBLP:conf/naacl/DevlinCLT19 as an example. To ensure the consistency of the experimental environment as much as possible, we uniformly adopt distillation methods implemented based on Hugging Face Transformers DBLP:conf/emnlp/WolfDSCDMCRLFDS20 as baselines. Following previous works DBLP:conf/emnlp/SunCGL19 ; DBLP:journals/corr/abs-1903-12136 ; DBLP:journals/eswa/JungKNK23 ; DBLP:conf/acl/ZhouXM22 , we do not use pre-training distillation settings and only distill at the fine-tuning stage. More experimental details are in Appendix D.

Table 6: Results on the dev set of GLUE. The results of knowledge distillation methods for BERT₄ and BERT₆ are reported by DBLP:journals/eswa/JungKNK23 ; DBLP:conf/acl/ZhouXM22 and the ^†results reported by us.

Method

#P(M)

Speedup

MNLI-m

acc.

MNLI-mm

acc.

MRPC

f1/acc.

QNLI

acc.

QQP

f1/acc.

RTE

acc.

STS-2

acc.

STS-B

pear./spear.

BERT

{}_{Base}^{\dagger}

110

1.0

\times

84.4

85.3

88.6/84.1

89.7

89.6/91.1

67.5

92.5

88.8/88.5

BERT₆-BKD

2.0

\times

82.2

82.9

86.2/80.8

88.5

88.0/91.0

65.4

90.9

88.2/87.8

BERT₆-PKD

2.0

\times

82.3

82.6

86.4/81.0

88.6

87.9/91.0

63.9

90.8

88.5/88.1

BERT₆-MiniLM

2.0

\times

82.2

82.6

84.6/78.1

89.5

87.2/90.5

61.5

90.2

87.8/87.5

BERT₆-RKD

2.0

\times

82.4

82.9

86.9/81.8

88.9

88.1/91.2

65.2

91.0

88.4/88.1

BERT₆-FSD

2.0

\times

82.4

83.0

87.1/82.2

89.0

88.1/91.2

66.6

91.0

88.7/88.3

BERT₄-BKD

2.9

\times

80.5

80.9

87.2/83.1

87.5

86.6/90.4

65.2

90.2

84.5/84.2

BERT₄-PKD

2.9

\times

80.9

81.3

87.0/82.9

87.7

86.8/90.5

66.1

90.5

84.3/84.0

BERT₄-MetaDistil

2.9

\times

82.4

82.7

88.4/84.2

88.6

87.8/90.8

67.8

91.8

86.3/86.0

BERT-HAT^†

2.0

\times

70.8

71.6

81.2/74.8

65.3

76.1/80.4

52.7

84.3

79.6/80.1

BERT-MED

2.0

\times

82.7

83.3

88.0/84.0

86.8

89.1/90.7

67.2

91.9

87.6/87.2

BERT-HAT^†

17.5

4.7

\times

63.6

64.2

68.4/78.4

61.1

69.0/79.7

47.2

82.9

74.1/75.8

BERT-MED

17.5

4.7

\times

81.2

82.4

86.1/82.0

86.4

83.8/86.2

64.6

88.2

86.1/86.4

BERT-HAT^†

6.36

5.2

\times

59.9

60.0

66.5/77.3

60.1

66.5/77.1

46.2

81.7

71.9/70.4

BERT-MED

6.36

5.2

\times

72.6

73.7

84.1/78.1

86.0

79.6/82.7

61.7

86.9

82.8/81.6

Table 6 shows the results on the development set of GLUE DBLP:conf/iclr/WangSMHLB19 . We compare MED with other KD models under similar speedup or a comparable number of parameters. The results show that MED achieves competitive performance on most tasks compared to BERT-specialized KD methods. In addition, when compared to HAT DBLP:conf/acl/WangWLCZGH20 , which shares the most similar model architecture to ours, sub-models of MED outperform HAT across three different parameter quantities. Specifically, sub-models with 54M, 17.5M, and 6.36M parameters achieve average $16.3\%$ , $21.7\%$ and $19.7\%$ improvements respectively.

5.6 Analysis of MED

5.6.1 Training efficiency

Table 7: Training time (hours).

		TransE		RotatE		PairRE
WN	DT	74.0	(9.49 $\times$ )	141.0	(11.10 $\times$ )	67.4	(10.06 $\times$ )
	Ext-based	1.5	(0.19 $\times$ )	2.5	(0.20 $\times$ )	1.6	(0.24 $\times$ )
	BKD	91.5	(11.73 $\times$ )	163.0	(12.83 $\times$ )	87.5	(13.06 $\times$ )
	TA	172.0	(22.05 $\times$ )	272.0	(21.42 $\times$ )	166.0	(24.78 $\times$ )
	DualDE	151.0	(19.36 $\times$ )	240.0	(18.90 $\times$ )	133.0	(19.85 $\times$ )
	MED	7.8	(1.00 $\times$ )	12.7	(1.00 $\times$ )	6.7	(1.00 $\times$ )
FB	DT	218.0	(10.23 $\times$ )	381.0	(10.73 $\times$ )	179.0	(9.37 $\times$ )
	Ext-based	4.7	(0.22 $\times$ )	9.5	(0.27 $\times$ )	3.7	(0.19 $\times$ )
	BKD	248.0	(11.64 $\times$ )	443.0	(12.48 $\times$ )	231.0	(12.09 $\times$ )
	TA	-	-	-	-	-	-
	DualDE	-	-	-	-	-	-
	MED	21.3	(1.00 $\times$ )	35.5	(1.00 $\times$ )	19.1	(1.00 $\times$ )

We report the training time of obtaining 64 models of all sizes ( $d$ =10, 20, …, 640) by different methods in Table 7. For DT, the training time cost is the sum of the time of directly training 64 KGE models of all sizes in turn. For the Ext-based baselines, the training time cost is the same and is equal to the time of training a $d_{n}$ -dimensional KGE model since the time of arranging dimensions is very short and negligible. For the KD-based baselines, the training time cost is the sum of the time of training the $d_{n}$ -dimensional teacher model and distilling 63 student models ( $d$ =10, 20, …, 630) in turn. All training is performed on a single NVIDIA Tesla A100 40GB GPU for fair comparison. For TA and DualDE on FB15K237, we don’t train student models of all 63 sizes, which is estimated to take more than 400 hours on each KGE method. Compared with directly trained (DT) models of all sizes in turn, MED accelerates by up to 10 $\times$ for 3 KGE methods. Although Ext-based baselines spend the shortest training time, they perform particularly poorly and lack practical value. TA and DualDE need to optimize both the student model and a larger teacher model, which greatly increases the training parameters and time cost.

5.6.2 Whether high-dimensional sub-models cover the capabilities of low-dimensional ones

If a high-dimensional model retains the ability of lower-dimensional models, it should correctly predict all triples that the lower-dimensional model can predict. We count the percentage of triples in test set that meet the condition that if the smallest sub-model that can correctly predict a given triple is $M_{i}$ , all higher-dimensional sub-models ( $M_{i+1}$ , $M_{i+2}$ , …, $M_{n}$ ) also correctly predict it, and denote the result as the ability retention ratio (ARR). We use Hit@10 to judge whether a triple is correctly predicted, that is, $M_{i}$ correctly predicts a triple if $M_{i}$ scores this triple in the top 10 among all candidate triples.

From Fig. 4, we find that ARR of MED is always much higher than baselines, especially on FB15K237, indicating that high-dimensional sub-models in MED successfully cover the power of low-dimensional ones, contributed by the mutual learning mechanism that helps high-dimensional sub-models review what low-dimensional sub-models have learned. Based on this advantage of MED, we can also provide a simple way to judge how easy or difficult a triple is for KGE methods to learn: the triple that low-dimensional sub-models can correctly predict may be easy since more high-dimensional models can also predict it, while triples that can only be predicted by a particularly high-dimensional sub-model are difficult.

5.6.3 Visual analysis of embedding

We select four primary entity categories (‘organization’, ‘sports’, ‘location’, and ‘music’) that contain more than 300 entities in FB15K237, and randomly select 250 entities for each. We cluster these entities’ embeddings of 3 different dimensions ( $d$ =10, 100, 600) by the t-SNE algorithm, and the clustering results are visualized in Fig. 5. Under the same dimension, the clustering result of MED is always the best, followed by DualDE, while the result of Ext-V is generally poor, which is consistent with the conclusion in Section 5.2. We also find some special phenomenons for MED when dimension increases: 1) the nodes of the ‘sports’ gradually become two clusters meaning MED learns more fine-grained category information as dimension increases. and 2) the relative distribution among different categories hardly changes and shows a trend of “inheritance” and “improvement”. This further proves MED achieves our expectation that high-dimensional sub-models retain the ability of low-dimensional sub-models, and can learn more knowledge than low-dimensional sub-models.

6 Conclusion

In this work, we propose a novel KGE training framework, MED, that trains a croppable KGE at once, and then sub-models of various required dimensions can be cropped out from it and used directly without additional training. In MED, we propose the mutual learning mechanism to improve low-dimensional sub-models performance and make the high-dimensional sub-models retain the ability of the low-dimensional ones, the evolutionary improvement mechanism to motivate high-dimensional sub-models to master more knowledge that low-dimensional ones cannot, and the dynamic loss weight to adaptively balance multiple losses. The experimental results show the effectiveness and high efficiency of our method, where all sub-models achieve promising performance, especially the performance of low-dimensional sub-models is greatly improved. In future work, we will further explore the more fine-grained information encoding ability of each sub-model.

References

[1] Yushan Zhu, Huaixiao Zhao, Wen Zhang, Ganqiang Ye, Hui Chen, Ningyu Zhang, and Huajun Chen. Knowledge perceived multi-modal pretraining in e-commerce. In ACM Multimedia, pages 2744–2752. ACM, 2021.
[2] Wen Zhang, Chi Man Wong, Ganqiang Ye, Bo Wen, Wei Zhang, and Huajun Chen. Billion-scale pre-trained e-commerce product knowledge graph model. In ICDE, pages 2476–2487. IEEE, 2021.
[3] Raphael Hoffmann, Congle Zhang, Xiao Ling, Luke S. Zettlemoyer, and Daniel S. Weld. Knowledge-based weak supervision for information extraction of overlap** relations. In ACL, pages 541–550. The Association for Computer Linguistics, 2011.
[4] Joachim Daiber, Max Jakob, Chris Hokamp, and Pablo N. Mendes. Improving efficiency and accuracy in multilingual entity extraction. In I-SEMANTICS, pages 121–124. ACM, 2013.
[5] Yuanzhe Zhang, Kang Liu, Shizhu He, Guoliang Ji, Zhanyi Liu, Hua Wu, and Jun Zhao. Question answering over knowledge base with neural attention combining global knowledge information. CoRR, abs/1606.00979, 2016.
[6] Dennis Diefenbach, Kamal Deep Singh, and Pierre Maret. Wdaqua-core1: A question answering service for RDF knowledge bases. In WWW (Companion Volume), pages 1087–1091. ACM, 2018.
[7] Antoine Bordes, Nicolas Usunier, Alberto García-Durán, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In NIPS, pages 2787–2795, 2013.
[8] Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. Rotate: Knowledge graph embedding by relational rotation in complex space. In ICLR (Poster). OpenReview.net, 2019.
[9] Yushan Zhu, Wen Zhang, Mingyang Chen, Hui Chen, Xu Cheng, Wei Zhang, and Huajun Chen. Dualde: Dually distilling knowledge graph embedding for faster and cheaper reasoning. In WSDM, pages 1516–1524. ACM, 2022.
[10] Mrinmaya Sachan. Knowledge graph embedding compression. In ACL, pages 2681–2691. Association for Computational Linguistics, 2020.
[11] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. CoRR, abs/1503.02531, 2015.
[12] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), pages 4171–4186. Association for Computational Linguistics, 2019.
[13] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
[14] Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding by translating on hyperplanes. In AAAI, pages 1112–1119. AAAI Press, 2014.
[15] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relation embeddings for knowledge graph completion. In AAAI, pages 2181–2187. AAAI Press, 2015.
[16] Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and Jun Zhao. Knowledge graph embedding via dynamic map** matrix. In ACL (1), pages 687–696. The Association for Computer Linguistics, 2015.
[17] Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective learning on multi-relational data. In ICML, pages 809–816. Omnipress, 2011.
[18] Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and relations for learning and inference in knowledge bases. In ICLR (Poster), 2015.
[19] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In ICML, volume 48 of JMLR Workshop and Conference Proceedings, pages 2071–2080. JMLR.org, 2016.
[20] Seyed Mehran Kazemi and David Poole. Simple embedding for link prediction in knowledge graphs. In NeurIPS, pages 4289–4300, 2018.
[21] Shuai Zhang, Yi Tay, Lina Yao, and Qi Liu. Quaternion knowledge graph embeddings. In NeurIPS, pages 2731–2741, 2019.
[22] Canran Xu and Ruijiang Li. Relation embedding with dihedral group in knowledge graph. In ACL (1), pages 263–272. Association for Computational Linguistics, 2019.
[23] Linlin Chao, Jianshan He, Taifeng Wang, and Wei Chu. Pairre: Knowledge graph embeddings via paired relation vectors. In ACL/IJCNLP (1), pages 4360–4369. Association for Computational Linguistics, 2021.
[24] Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2d knowledge graph embeddings. In AAAI, pages 1811–1818. AAAI Press, 2018.
[25] Dai Quoc Nguyen, Tu Dinh Nguyen, Dat Quoc Nguyen, and Dinh Q. Phung. A novel embedding model for knowledge base completion based on convolutional neural network. In NAACL-HLT (2), pages 327–333. Association for Computational Linguistics, 2018.
[26] Michael Sejr Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. In ESWC, volume 10843 of Lecture Notes in Computer Science, pages 593–607. Springer, 2018.
[27] Shikhar Vashishth, Soumya Sanyal, Vikram Nitin, and Partha P. Talukdar. Composition-based multi-relational graph convolutional networks. In ICLR. OpenReview.net, 2020.
[28] Seyed-Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Improved knowledge distillation via teacher assistant. In AAAI, pages 5191–5198. AAAI Press, 2020.
[29] Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, ** **, Xin Jiang, Qun Liu, Michael R. Lyu, and Irwin King. Binarybert: Pushing the limit of BERT quantization. In ACL/IJCNLP (1), pages 4334–4348. Association for Computational Linguistics, 2021.
[30] Pierre Stock, Angela Fan, Benjamin Graham, Edouard Grave, Rémi Gribonval, Hervé Jégou, and Armand Joulin. Training with quantization noise for extreme model compression. In ICLR. OpenReview.net, 2021.
[31] Haoyu Wang, Yaqing Wang, Defu Lian, and **g Gao. A lightweight knowledge graph embedding framework for efficient inference and storage. In CIKM, pages 1909–1918. ACM, 2021.
[32] Siqi Sun, Yu Cheng, Zhe Gan, and **g**g Liu. Patient knowledge distillation for BERT model compression. In EMNLP/IJCNLP (1), pages 4322–4331. Association for Computational Linguistics, 2019.
[33] Kai Wang, Yu Liu, Qian Ma, and Quan Z. Sheng. Mulde: Multi-teacher knowledge distillation for low-dimensional knowledge graph embeddings. In WWW, pages 1716–1726. ACM / IW3C2, 2021.
[34] Zhehui Zhou, Defang Chen, Can Wang, Yan Feng, and Chun Chen. Improving knowledge graph embedding via iterative self-semantic knowledge distillation. CoRR, abs/2206.02963, 2022.
[35] Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, and Jimmy Lin. Distilling task-specific knowledge from BERT into simple neural networks. CoRR, abs/1903.12136, 2019.
[36] Huber and J. Peter. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1):73–101, 1964.
[37] Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon. Representing text for joint embedding of text and knowledge bases. In EMNLP, pages 1499–1509. The Association for Computational Linguistics, 2015.
[38] Tara Safavi and Danai Koutra. Codex: A comprehensive knowledge graph completion benchmark. In EMNLP (1), pages 8328–8350. Association for Computational Linguistics, 2020.
[39] Farzaneh Mahdisoltani, Joanna Biega, and Fabian M. Suchanek. YAGO3: A knowledge base from multilingual wikipedias. In CIDR. www.cidrdb.org, 2015.
[40] Mingyang Chen, Wen Zhang, Zhen Yao, Yushan Zhu, Yang Gao, Jeff Z. Pan, and Huajun Chen. Entity-agnostic representation learning for parameter-efficient knowledge graph embedding. In AAAI, pages 4182–4190. AAAI Press, 2023.
[41] Xu Han, Shulin Cao, Xin Lv, Yankai Lin, Zhiyuan Liu, Maosong Sun, and Juanzi Li. Openke: An open toolkit for knowledge embedding. In EMNLP (Demonstration), pages 139–144. Association for Computational Linguistics, 2018.
[42] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR (Poster), 2015.
[43] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. In ICLR (Poster). OpenReview.net, 2017.
[44] Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In ACL (1), pages 5797–5808. Association for Computational Linguistics, 2019.
[45] Mikhail Galkin, Etienne G. Denis, Jiapeng Wu, and William L. Hamilton. Nodepiece: Compositional and parameter-efficient representations of large knowledge graphs. In ICLR. OpenReview.net, 2022.
[46] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art natural language processing. In Qun Liu and David Schlangen, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020 - Demos, Online, November 16-20, 2020, pages 38–45. Association for Computational Linguistics, 2020.
[47] Hee-Jun Jung, Doyeon Kim, Seung-Hoon Na, and Kangil Kim. Feature structure distillation with centered kernel alignment in BERT transferring. Expert Syst. Appl., 234:120980, 2023.
[48] Wangchunshu Zhou, Canwen Xu, and Julian J. McAuley. BERT learns to teach: Knowledge distillation with meta learning. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 7037–7049. Association for Computational Linguistics, 2022.
[49] Hanrui Wang, Zhanghao Wu, Zhijian Liu, Han Cai, Ligeng Zhu, Chuang Gan, and Song Han. HAT: hardware-aware transformers for efficient natural language processing. In ACL, pages 7675–7688. Association for Computational Linguistics, 2020.
[50] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. Neural collaborative filtering. In WWW, pages 173–182. ACM, 2017.
[51] William B. Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing, IWP@IJCNLP 2005, Jeju Island, Korea, October 2005, 2005. Asian Federation of Natural Language Processing, 2005.
[52] Alexis Conneau and Douwe Kiela. Senteval: An evaluation toolkit for universal sentence representations. In Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Kôiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis, and Takenobu Tokunaga, editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7-12, 2018. European Language Resources Association (ELRA), 2018.
[53] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1631–1642. ACL, 2013.
[54] Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Marilyn A. Walker, Heng Ji, and Amanda Stent, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics, 2018.
[55] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100, 000+ questions for machine comprehension of text. In Jian Su, Xavier Carreras, and Kevin Duh, editors, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 2383–2392. The Association for Computational Linguistics, 2016.
[56] Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In CVPR, pages 3967–3976. Computer Vision Foundation / IEEE, 2019.
[57] Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.

Appendix A More Results of link prediction

More results of link prediction are shown in Table 8 and Table 9 for WN18RR, and Table 10 and Table 11 for FB15K237. All comparison results of sub-models of MED to the directly trained KGEs (DT) of 10- to 640-dimension are shown in Fig. 6.

Table 8: MRR and Hit@1 of some representative dimensions on WN18RR.

		10d		20d		40d		80d		160d		320d		640d
	Method	MRR	Hit@1	MRR	Hit@1	MRR	Hit@1	MRR	Hit@1	MRR	Hit@1	MRR	Hit@1	MRR	Hit@1
TransE	DT	.121	.011	.176	.016	.214	.018	.227	.025	.233	.027	.235	.033	.237	.034
	Ext	.125	.016	.172	.023	.199	.023	.213	.028	.225	.033	.226	.028	.237	.034
	Ext-L	.139	.029	.196	.025	.224	.039	.232	.046	.236	.036	.236	.033	.237	.034
	Ext-V	.139	.029	.198	.045	.222	.051	.234	.047	.236	.036	.236	.027	.237	.034
	BKD	.141	.035	.207	.040	.226	.033	.232	.031	.233	.030	.236	.032	-	-
	TA	.144	.040	.211	.043	.226	.037	.233	.030	.234	.030	.236	.034	-	-
	DualDE	.148	.037	.213	.043	.225	.037	.234	.031	.235	.031	.238	.034	-	-
	MED	.170	.040	.219	.045	.232	.048	.232	.042	.236	.037	.237	.033	.237	.031
RotatE	DT	.172	.005	.409	.357	.456	.393	.465	.420	.471	.423	.474	.428	.476	.429
	Ext	.299	.257	.379	.335	.437	.395	.458	.415	.467	.413	.471	.418	.476	.429
	Ext-L	.206	.166	.336	.288	.399	.352	.423	.373	.445	.396	.466	.417	.476	.429
	Ext-V	.261	.197	.304	.234	.337	.263	.366	.293	.416	.357	.451	.397	.476	.429
	BKD	.175	.009	.424	.361	.457	.403	.471	.421	.472	.424	.474	.425	-	-
	TA	.177	.010	.424	.363	.459	.408	.470	.420	.473	.422	.474	.425	-	-
	DualDE	.179	.011	.425	.364	.462	.412	.471	.423	.473	.426	.475	.425	-	-
	MED	.324	.277	.456	.409	.466	.418	.471	.422	.471	.424	.476	.427	.476	.428
PairRE	DT	.220	.174	.342	.313	.415	.384	.435	.399	.449	.405	.452	.406	.453	.407
	Ext	.152	.120	.261	.198	.334	.267	.375	.314	.419	.364	.438	.388	.453	.407
	Ext-L	.162	.129	.281	.237	.363	.319	.417	.377	.437	.395	.446	.400	.453	.407
	Ext-V	.172	.124	.306	.269	.389	.352	.420	.379	.441	.398	.446	.400	.453	.407
	BKD	.228	.184	.375	.334	.421	.372	.443	.405	.451	.405	.453	.407	-	-
	TA	.245	.197	.381	.332	.426	.380	.448	.404	.452	.409	.453	.408	-	-
	DualDE	.242	.175	.377	.330	.428	.381	.451	.409	.453	.410	.454	.410	-	-
	MED	.317	.259	.408	.367	.433	.392	.449	.405	.451	.406	.451	.407	.451	.406

Table 9: Hit@10 and Hit@3 of some representative dimensions on WN18RR.

		10d		20d		40d		80d		160d		320d		640d
	Method	Hit@10	Hit@3	Hit@10	Hit@3	Hit@10	Hit@3	Hit@10	Hit@3	Hit@10	Hit@3	Hit@10	Hit@3	Hit@10	Hit@3
TransE	DT	.287	.202	.453	.291	.496	.385	.524	.401	.531	.403	.534	.407	.537	.412
	Ext	.298	.201	.423	.285	.468	.338	.495	.364	.515	.384	.521	.388	.537	.412
	Ext-L	.315	.218	.461	.317	.497	.361	.516	.403	.534	.405	.535	.408	.537	.412
	Ext-V	.309	.218	.458	.314	.494	.391	.525	.407	.532	.408	.536	.411	.537	.412
	BKD	.323	.216	.480	.331	.513	.392	.527	.401	.531	.404	.533	.407	-	-
	TA	.335	.224	.483	.343	.512	.395	.527	.408	.533	.407	.535	.410	-	-
	DualDE	.337	.226	.488	.346	.514	.394	.530	.408	.533	.408	.535	.411	-	-
	MED	.388	.269	.491	.369	.518	.399	.523	.404	.529	.407	.536	.410	.537	.412
RotatE	DT	.418	.304	.504	.436	.556	.475	.564	.487	.567	.489	.573	.491	.575	.493
	Ext	.378	.315	.464	.399	.516	.452	.544	.472	.549	.480	.552	.470	.575	.493
	Ext-L	.277	.224	.424	.359	.487	.420	.515	.441	.541	.461	.564	.481	.575	.493
	Ext-V	.377	.289	.433	.336	.471	.377	.497	.402	.532	.442	.561	.467	.575	.493
	BKD	.434	.312	.540	.452	.556	.479	.565	.487	.570	.490	.572	.492	-	-
	TA	.438	.314	.542	.452	.558	.481	.567	.489	.572	.488	.572	.492	-	-
	DualDE	.440	.320	.542	.452	.559	.483	.567	.489	.573	.488	.573	.491	-	-
	MED	.469	.354	.543	.476	.561	.486	.568	.490	.574	.492	.573	.493	.574	.495
PairRE	DT	.321	.271	.381	.368	.472	.428	.516	.450	.534	.463	.542	.462	.544	.464
	Ext	.209	.163	.379	.292	.463	.366	.493	.398	.526	.437	.545	.452	.544	.464
	Ext-L	.220	.175	.360	.302	.442	.383	.495	.431	.523	.450	.544	.455	.544	.464
	Ext-V	.260	.192	.374	.323	.456	.407	.498	.435	.529	.452	.541	.458	.544	.464
	BKD	.336	.279	.413	.388	.483	.435	.525	.452	.536	.460	.542	.463	-	-
	TA	.340	.293	.427	.387	.487	.437	.534	.460	.537	.462	.543	.463	-	-
	DualDE	.336	.281	.424	.389	.495	.437	.536	.463	.540	.463	.544	.465	-	-
	MED	.376	.314	.467	.426	.502	.443	.537	.462	.541	.464	.542	.465	.542	.464

Table 10: MRR and Hit@1 of some representative dimensions on FB15K237.

		10d		20d		40d		80d		160d		320d		640d
	Method	MRR	Hit@1	MRR	Hit@1	MRR	Hit@1	MRR	Hit@1	MRR	Hit@1	MRR	Hit@1	MRR	Hit@1
TransE	DT	.150	.102	.277	.190	.299	.212	.313	.218	.315	.222	.318	.224	.322	.228
	Ext	.115	.065	.191	.122	.236	.156	.266	.180	.286	.197	.299	.208	.322	.228
	Ext-L	.109	.065	.175	.115	.232	.157	.263	.180	.285	.198	.301	.210	.322	.228
	Ext-V	.139	.081	.200	.126	.237	.156	.270	.185	.293	.205	.308	.217	.322	.228
	BKD	.176	.106	.279	.198	.303	.208	.315	.222	.315	.223	.320	.226	-	-
	TA	.175	.112	.281	.200	.303	.212	.314	.220	.319	.225	.321	.223	-	-
	DualDE	.179	.115	.281	.201	.306	.216	.316	.223	.319	.226	.322	.227	-	-
	MED	.196	.122	.290	.199	.308	.218	.317	.223	.320	.226	.321	.227	.322	.227
RotatE	DT	.254	.168	.297	.207	.312	.223	.317	.224	.322	.229	.323	.230	.325	.234
	Ext	.138	.080	.203	.129	.251	.170	.276	.190	.291	.203	.305	.217	.325	.234
	Ext-L	.135	.078	.188	.121	.221	.146	.246	.166	.280	.193	.299	.209	.325	.234
	Ext-V	.160	.097	.198	.126	.238	.159	.265	.182	.288	.201	.302	.213	.325	.234
	BKD	.277	.193	.305	.214	.314	.224	.321	.230	.322	.230	.323	.231	-	-
	TA	.280	.196	.306	.216	.313	.225	.319	.229	.323	.229	.323	.231	-	-
	DualDE	.282	.197	.307	.216	.315	.227	.318	.230	.322	.232	.324	.233	-	-
	MED	.288	.201	.311	.216	.318	.225	.322	.231	.323	.233	.324	.233	.324	.232
PairRE	DT	.182	.116	.243	.162	.284	.202	.307	.222	.319	.227	.328	.235	.332	.237
	Ext	.148	.107	.177	.118	.217	.149	.259	.182	.294	.207	.321	.230	.332	.237
	Ext-L	.150	.099	.196	.134	.219	.159	.271	.188	.309	.219	.326	.233	.332	.237
	Ext-V	.176	.116	.192	.125	.229	.154	.279	.193	.311	.221	.329	.237	.332	.237
	BKD	.198	.132	.251	.168	.288	.203	.311	.224	.321	.233	.330	.236	-	-
	TA	.208	.139	.263	.182	.292	.210	.314	.224	.323	.232	.332	.235	-	-
	DualDE	.207	.139	.261	.179	.293	.212	.316	.226	.326	.234	.334	.238	-	-
	MED	.239	.172	.274	.189	.303	.213	.314	.224	.324	.232	.329	.236	.330	.235

Table 11: Hit@10 and Hit@3 of some representative dimensions on FB15K237.

		10d		20d		40d		80d		160d		320d		640d
	Method	Hit@10	Hit@3	Hit@10	Hit@3	Hit@10	Hit@3	Hit@10	Hit@3	Hit@10	Hit@3	Hit@10	Hit@3	Hit@10	Hit@3
TransE	DT	.235	.169	.440	.301	.477	.327	.484	.340	.499	.348	.501	.353	.508	.358
	Ext	.211	.123	.324	.211	.392	.264	.436	.296	.462	.320	.479	.331	.508	.358
	Ext-L	.194	.118	.293	.192	.381	.256	.424	.292	.462	.316	.484	.333	.508	.358
	Ext-V	.256	.150	.348	.222	.396	.265	.437	.301	.466	.325	.488	.341	.508	.358
	BKD	.293	.178	.446	.308	.480	.336	.500	.349	.501	.349	.502	.354	-	-
	TA	.246	.188	.441	.307	.484	.336	.498	.348	.504	.353	.504	.355	-	-
	DualDE	.301	.193	.443	.307	.483	.337	.502	.351	.505	.354	.508	.356	-	-
	MED	.341	.215	.472	.321	.486	.338	.502	.347	.505	.351	.507	.356	.507	.358
RotatE	DT	.424	.284	.477	.330	.495	.346	.502	.352	.506	.353	.510	.357	.515	.363
	Ext	.245	.152	.340	.225	.410	.278	.443	.304	.465	.322	.485	.335	.515	.363
	Ext-L	.243	.147	.319	.209	.365	.247	.402	.275	.453	.312	.477	.333	.515	.363
	Ext-V	.281	.174	.340	.218	.393	.264	.427	.293	.458	.319	.478	.336	.515	.363
	BKD	.442	.306	.485	.338	.503	.352	.508	.354	.510	.356	.509	.358	-	-
	TA	.447	.308	.485	.339	.501	.353	.507	.358	.510	.359	.509	.358	-	-
	DualDE	.449	.311	.486	.341	.502	.353	.507	.360	.512	.361	.514	.361	-	-
	MED	.459	.324	.492	.344	.504	.355	.509	.357	.510	.358	.512	.362	.514	.362
PairRE	DT	.314	.198	.395	.262	.452	.312	.476	.337	.505	.352	.518	.364	.522	.368
	Ext	.222	.158	.289	.187	.353	.236	.416	.283	.469	.325	.506	.354	.522	.368
	Ext-L	.249	.159	.294	.196	.333	.238	.436	.298	.489	.342	.513	.359	.522	.368
	Ext-V	.277	.181	.303	.192	.374	.250	.450	.307	.490	.343	.513	.362	.522	.368
	BKD	.332	.215	.407	.265	.453	.314	.487	.343	.508	.355	.521	.366	-	-
	TA	.346	.226	.430	.291	.455	.316	.493	.347	.509	.358	.521	.368	-	-
	DualDE	.342	.224	.427	.286	.456	.318	.495	.351	.512	.359	.524	.371	-	-
	MED	.384	.253	.437	.299	.466	.327	.495	.346	.510	.357	.521	.366	.520	.368

Appendix B Ablation Study

We conduct ablation studies to evaluate the effect of three modules in MED: the mutual learning mechanism (MLM), the evolutionary improvement mechanism (EIM), and the dynamic loss weight (DLW). Table 12 shows the MRR and Hit@ $k$ ( $k=1,3,10$ ) of MED removing these modules respectively on WN18RR and TransE.

Table 12: Ablation study on WN18RR with TransE.

dim	MED				MED w/o MLM				MED w/o EIM				MED w/o DLW
dim	MRR	Hit@10	Hit@3	Hit@1	MRR	Hit@10	Hit@3	Hit@1	MRR	Hit@10	Hit@3	Hit@1	MRR	Hit@10	Hit@3	Hit@1
10	.170	.388	.269	.036	.149	.335	.234	.032	.169	.388	.267	.037	.171	.387	.268	.035
20	.219	.491	.369	.042	.197	.437	.323	.032	.217	.488	.366	.044	.218	.487	.367	.039
40	.232	.518	.399	.048	.224	.496	.379	.029	.232	.517	.403	.042	.232	.517	.402	.037
80	.232	.523	.404	.042	.228	.521	.399	.033	.235	.529	.408	.037	.234	.523	.410	.041
160	.236	.529	.407	.037	.234	.525	.406	.034	.234	.527	.405	.032	.235	.527	.405	.032
320	.237	.536	.410	.033	.236	.532	.409	.035	.233	.530	.398	.031	.234	.533	.405	.029
640	.237	.537	.412	.031	.238	.535	.412	.042	.232	.528	.402	.029	.233	.530	.396	.025

B.1 Mutual Learning Mechanism (MLM)

We remove the mutual learning mechanism from MED and keep the other parts unchanged, where (5) is rewritten as

L=\sum_{i=1}^{n}\exp\left(\frac{w_{3}\cdot d_{i}}{d_{n}}\right)\cdot L_{EI}^{i}.

(6)

From the result of “MED w/o MLM” in Table 12, we find that after removing the mutual learning mechanism, the performance of low-dimensional sub-models deteriorates seriously since the low-dimensional sub-models can not learn from the high-dimensional sub-models. For example, the MRR of the 10-dimensional sub-model decreased by $12.4\%$ , and the MRR of the 20-dimensional sub-model decreased by $10\%$ . While the performance degradation of the high-dimensional sub-model is not particularly obvious, and the MRR of the highest-dimensional sub-model ( $dim=640$ ) is not worse than that of MED, which is because to a certain degree, removing the mutual learning mechanism also avoids the negative influence to high-dimensional sub-models from low-dimensional sub-models. On the whole, this mechanism greatly improves the performance of low-dimensional sub-models.

B.2 Evolutionary Improvement Mechanism (EIM)

In this part, we replace evolutionary improvement loss $L_{EI}^{i}$ in (5) with the regular KGE loss $L_{KGE}^{i}$ :

\displaystyle L_{KGE}^{i}=\sum_{(h,r,t)\in\mathcal{T}\cup\mathcal{T}^{-}}y\log% \sigma(s_{(h,r,t)}^{i})+(1-y)\log(1-\sigma(s_{(h,r,t)}^{i})).

(7)

From the result of “MED w/o EIM” in Table 12, we find that removing the evolutionary improvement mechanism mainly degrades the performance of high-dimensional sub-models. While due to the existence of the mutual learning mechanism, the low-dimensional sub-model can still learn from the high-dimensional sub-model, so as to ensure the certain performance of the low-dimensional sub-model. In addition, we also find that as the dimension increases to a certain extent, the performance of the sub-model does not improve, and even begins to decline. We guess that this is because the mutual learning mechanism makes every pair of neighbor sub-models learn from each other, resulting in some low-quality or wrong knowledge gradually transferring from the low-dimensional sub-models to the high-dimensional sub-models, and when the evolutionary improvement mechanism is removed, the high-dimensional sub-models can no longer correct the wrong information from the low-dimensional sub-models. The higher the dimension of the sub-model, the more the accumulated error, so the performance of the high-dimensional sub-models is seriously damaged. On the whole, this mechanism mainly helps to improve the effect of high-dimensional sub-models.

B.3 Dynamic Loss Weight (DLW)

To study the effect of the dynamic loss weight, we fix the ratio of all mutual learning losses to all evolutionary improvement losses as $1:1$ , and (5) is rewritten as

L=\sum_{i=2}^{n}L_{ML}^{i-1,i}+\sum_{i=1}^{n}L_{EI}^{i}.

(8)

According to the result of “MED w/o DLW” in Table 12, the overall results of “MED w/o DLW” are in the middle of the results of “MED w/o MLM” and “MED w/o EIM”: the performance of the low-dimensional sub-model is better than that of “MED w/o MLM”, and the performance of the high-dimensional sub-model is better than that of “MED w/o EIM”. On the whole, its results are more similar to “MED w/o EIM”, that is, the performance of the low-dimensional sub-model does not change much, while the performance of the high-dimensional sub-model decreases more significantly. We believe that for the high-dimensional sub-model, the proportion of mutual learning loss is still too large, which makes it more negatively affected by the low-dimensional sub-model. This result indicates that the dynamic loss weight plays a role in adaptively balancing multiple losses and contributes to improving overall performance.

Appendix C Details of applying the trained KGE by MED to real applications

The SKG is used in many tasks related to users, and injecting user embeddings trained over SKG into downstream task models is a common and practical way.

User labeling is one of the common user management tasks that e-commerce platforms run on backend servers. We model user labeling as a multiclass classification task for user embeddings with a 2-layer MLP:

\mathcal{L}=-\frac{1}{|\mathcal{U}|}\sum_{i=1}^{|\mathcal{U}|}\sum_{j=1}^{|% \mathcal{CLS}|}y_{ij}\log(\mathrm{MLP}(u_{i})),

(9)

where $u_{i}$ is the $i$ -th user’s embedding, the label $y_{ij}=1$ if user $u_{i}$ belongs to class $cls_{j}$ , otherwise $y_{ij}=0$ .

The product recommendation task is to properly recommend items to users that users will interact with a high probability and it often runs on terminal devices. Following PKGM DBLP:conf/icde/ZhangWYWZC21 , which recommends items to users using the neural collaborative filtering (NCF) DBLP:conf/www/HeLZNHC17 framework with the help of pre-trained user embeddings as service vectors, we add trained user embeddings over SKG as service vectors to NCF. In NCF, the MLP layer is used to learn item-user interactions based on the latent feature of the user and item, that is, for a given user-item pair $user_{i}-item_{j}$ , the interaction function is

\phi_{1}^{MLP}\left(p_{i},q_{j}\right)=\mathrm{MLP}([p_{i};q_{j}]),

(10)

where $p_{i}$ and $q_{j}$ are latent feature vectors of user and item learned in NCF. We add the trained user embedding $u_{i}$ to NCF’s MLP layer and rewrite Equation (10) as

\phi_{1}^{MLP}\left(p_{i},q_{j},u_{i}\right)=\mathrm{MLP}([p_{i};q_{j};u_{i}]),

(11)

and the other parts of NCF stay the same as in PKGM DBLP:conf/icde/ZhangWYWZC21 .

We train entity and relation embeddings for SKG based on TransE DBLP:conf/nips/BordesUGWY13 and input the trained entity (user) embedding into Equation (9) and Equation (11).

Appendix D Details of extending MED to language model BERT-base

D.1 Dataset and Evaluation Metric

For the experiments extending MED to BERT, we adopt the common GLUE DBLP:conf/iclr/WangSMHLB19 benchmark for evaluation. To be specific, we use the development set of the GLUE benchmark which includes four tasks: Paraphrase Similarity Matching, Sentiment Classification, Natural Language Inference, and Linguistic Acceptability. For Paraphrase Similarity Matching, we use MRPC DBLP:conf/acl-iwp/DolanB05 , QQP and STS-B DBLP:conf/lrec/ConneauK18 for evaluation. For Sentiment Classification, we use SST-2 DBLP:conf/emnlp/SocherPWCMNP13 . For Natural Language Inference, we use MNLI DBLP:conf/naacl/WilliamsNB18 , QNLI DBLP:conf/emnlp/RajpurkarZLL16 , and RTE for evaluation. In terms of evaluation metrics, we follow previous work DBLP:conf/naacl/DevlinCLT19 ; DBLP:conf/emnlp/SunCGL19 . For MRPC and QQP, we report F1 and accuracy. For STS-B, we consider Pearson and Spearman correlation as our metrics. The other tasks use accuracy as the metric. For MNLI, the results of MNLI-m and MNLI-mm are both reported separately.

D.2 Baselines

For comparison, we choose Knowledge Distillation (KD) models and Hardware-Aware Transformers DBLP:conf/acl/WangWLCZGH20 (HAT) customized for transformers as baselines. For the KD models, we compare MED with Basic KD (BKD) DBLP:journals/corr/HintonVD15 , Patient KD (PKD) DBLP:conf/emnlp/SunCGL19 , Relational Knowledge Distillation (RKD) DBLP:conf/cvpr/ParkKLC19 , Deep Self-attention Distillation (MiniLM) DBLP:conf/nips/WangW0B0020 , Meta Learning-based KD (MetaDistill) DBLP:conf/acl/ZhouXM22 and Feature Structure Distillation (FSD) DBLP:journals/eswa/JungKNK23 . For the comparability of the results, we choose 4-layer BERT (BERT₄) or 6-layer BERT (BERT₆) as the student model architectures, which guarantees that the number of model parameters (#P(M)) or speedup is comparable. For HAT, we use the same model architecture as our MED for training and show the results of sub-models with three parameter scales.

D.3 Implementation

To implement MED on BERT, for the word embedding layer, all sub-models share the front portion of embedding parameters in the same way as in KGE, and for the transformer layer, all sub-models share the front portion of weight parameters as in HAT DBLP:conf/acl/WangWLCZGH20 . Specifically, assuming that the embedding dimension of the largest BERT model $B_{n}$ is $d_{n}$ , and the embedding dimension of the sub-model $B_{i}$ is $d_{i}$ , for any parameter matrix with the shape $x\times y$ in $B_{n}$ , the front portion sub-matrix of it with the shape $\frac{d_{i}}{d_{n}}x\times\frac{d_{i}}{d_{n}}y$ is the parameter matrix of the corresponding position in $B_{i}$ . Finally, it just need to replace the triple score $s_{(h,r,t)}$ in Equation (1), Equation (2), Equation (3), and Equation (4) with the logits output for the corresponding category of the classifier in the classification task.

We set $n=4$ for BERT applying MED, and 4 sub-models have the following settings: [768, 512, 256, 128] for embedding dim and [768, 512, 256, 128] for hidden dim, [12, 12, 6, 6] for the head number in attention modules, 12 for encoder layer number.