TAROT: A Hierarchical Framework with Multitask Co-Pretraining on Semi-Structured Data towards Effective Person-Job Fit

Abstract

Person-job fit is an essential part of online recruitment platforms in serving various downstream applications like Job Search and Candidate Recommendation. Recently, pretrained large language models have further enhanced the effectiveness by leveraging richer textual information in user profiles and job descriptions apart from user behavior features and job metadata. However, the general domain-oriented design struggles to capture the unique structural information within user profiles and job descriptions, leading to a loss of latent semantic correlations. We propose TAROT, a hierarchical multitask co-pretraining framework, to better utilize structural and semantic information for informative text embeddings. TAROT targets semi-structured text in profiles and jobs, and it is co-pretained with multi-grained pretraining tasks to constrain the acquired semantic information at each level. Experiments on a real-world LinkedIn dataset show significant performance improvements, proving its effectiveness in person-job fit tasks.

Index Terms— person-job fit, structured text, multi-task

1 Introduction

Refer to caption — Fig. 1: The framework of TAROT. Gray boxes refer to different pretraining tasks at each level.

Reducing unemployment is a permanent theme for labor and government, especially during epidemic [1]. Enhancing recruitment efficiency, such as Person-Job fit accuracy, can significantly lower unemployment rates, reduce expensive recruitment costs and eliminate job seekers’ wasted efforts [2].

Traditional efforts on Person-Job fit intent to utilize features from user behaviors or job metadata, like collaborative filtering for job recommendation [3, 4]. The rapid development of online recruitment platforms like LinkedIn and Indeed has facilitated the use of deep learning to obtain text embeddings from large-scale user profiles and job descriptions [5, 6]. Recently, large language models (LLMs, e.g., BERT [7] and GPT-3 [8]) have proven to be effective in natural language processing and understanding, which become a new choice for learning text representations [9, 10, 11]. However, text organized in structures and domain-specific semantics in Person-Job fit may lead to failures of general domain text-oriented LLMs.

The practical scenario raises new challenges in pretraining LLMs for semi-structured data from specific domains. Firstly, pretrained LLMs are often oriented to general domain corpora, being of low relevance or even contradictory with domain-specific corpora (e.g., abbreviations), leading to embedding collapse of LLMs without domain-specific pretraining [12, 13]. Secondly, unlike plain text, texts in user profiles and job descriptions are primarily organized in a domain-specific hierarchical structures since people tend to format them for better illustration of their purposes. Such domain-specific information can undoubtedly promote models in understanding semantics [14, 15, 16, 17, 18], while it is ignored in current approaches, leaving an improving opportunity.

To tackle these challenges, we propose a hierarchically designed multitask framework TAROT to co-pretrain large language models by incorporating structure information. As Figure 1 shows, some unstructured job descriptions are segmented by LinkedIn services according to pre-defined sections, constituting recruitment data together with naturally structural user profiles and the rest structured jobs. To match the structure, we elaborate TAROT with corresponding hierarchical structures: sentence $\rightarrow$ section $\rightarrow$ individual $\rightarrow$ interaction level from bottom to top. The semantics of sentences are extracted by BERT, and upper-level embeddings are derived from the attention fusion layer on the lower level. Interaction between job and user embeddings is enhanced via a cross-attention layer so that they can be adaptively adjusted based on needs from the other side. In addition, hierarchical pretraining tasks are designed to encourage integration of structural information and to constrain the model on learning domain-relevant semantics at each level. We use the output embeddings in Person-Job Fit downstream tasks for evaluations. Experimental results on two tasks demonstrate the superiority of TAROT over other baseline methods. The main contributions can be summarized as follows:

•

We propose a hierarchically structured framework for representation learning of person-job fit domain-specific text data as a complement to traditional features.
•

We propose multi-grained pretraining tasks specifically for the person-job fit area.
•

Extensive experiments are conducted to verify the effectiveness of our design and the benefits to downstream tasks.

2 TAROT

2.1 Preliminaries

We denote the set of users as $U=\{u\}$ and jobs as $J=\{j\}$ . Job descriptions $j$ are divided into sections $S^{j}$ : Responsibilities, Qualifications, Requirements, Job Title, Functions, Skills, Benefits and Company; and profile sections $S^{u}$ include Summary, Headline, Education, Position and Skills. Sections consist of sentences like $S^{j}=[s^{j}_{1},\cdots,s^{j}_{k}]$ where $k$ is the number of sentences in $S^{j}$ . The objective of Person-job fit is to predict the matching degree between $u$ and $j$ based on the output embedding of a learned language model.

2.2 Hierarchical Structured Language Model

2.2.1 Language Model

As a pretraining language model, BERT [7] has demonstrated promising capabilities in natural language processing tasks in recent years. To empower BERT, TAROT continue-pretrains BERT on large-scale corpora from user profiles and job descriptions on LinkedIn. Sentences are fed into TAROT’s language model section-by-section to obtain embeddings.

2.2.2 Attention Fusion Layer

As a hierarchically structured model, it is crucial for TAROT to aggregate current level information for the upper level. Section level representations require measuring the importance of different sentences, while individual level embeddings demand distinguishments between sections. Therefore, we adopt the attention-based fusion method [19] to adaptively learn the difference for embeddings at these two levels. Take profile representation learning at the individual level for example. The embedding sequence of sections is denoted as $E^{u}=[E^{u}_{*}]$ . Mathematically, the attention-based fusion embedding is generated as:

\begin{split}\tilde{E}^{u}&=\textbf{Pooling}(E^{u}_{*}),\\ E_{u}^{\prime}&=\tilde{E}^{u}+\textbf{Attention}(Q=\tilde{E}^{u},K=E^{u},V=E^{% u}).\end{split}

(1)

The pooled embedding $\tilde{E}^{u}$ guides the attention fusion layer to acquire appropriate weights for each section and generate global context-aware individual representations $E_{u}^{\prime}$ for user $u$ .

2.2.3 Cross-Attention Layer

Empirical evidence suggests that different users will be attracted by different sections of the same job description, and this is similar in the recruiter-profile relationship. It inspires us that profile or job description representation learning should not be isolated. Hence, we design the cross-attention layer where the job-oriented attention takes job embeddings $E^{\prime}_{j}$ as queries on the profile embeddings $E^{\prime}_{u}$ and obtain $A_{j}$ . Similarly, we have $A_{u}$ from user-oriented attention, and concatenate $A_{j}$ and $A_{u}$ as the output. It is then combined with $E^{\prime}_{j}$ or $E^{\prime}_{u}$ as the final embedding for jobs/user.

2.3 Hierarchical Pretraining Tasks

2.3.1 Sentence-level: Masked Language Model

Although job descriptions and user profiles are semi-structured data, the sequence of sentences still remains a critical role. Therefore, the classical Masked Language Model (MLM) [7] is adopted to allow TAROT to emphasize more on the recruitment-related corpora.

2.3.2 Section-level: Experience Classification

Given a section $S^{u}$ from user profile $u$ , the Experience Classification task is defined as a multi-class classification that utilizes the context under $S^{u}$ to predict its section name from $\{Summary,Headline,Education,Position,Skills\}$ . Technically, sentences of $S^{u}$ will be fed into BERT to get the representations, and then the output of the entire section is taken as the input of a Multi-Layer Perceptron (MLP) to predict the label of section name $\tilde{y}_{S^{u}}$ . Experience Classification is to minimize the cross-entropy loss between $\tilde{y}_{S^{u}}$ and the real section name label $y_{S^{u}}$ of $n_{Exp}$ samples. For convenience, the objective is formulated as Eq. 2, e.g., Experience Classification objective is $L_{Exp}(n_{Exp},\tilde{y}_{S^{u}},y_{S^{u}})$ :

L_{task}(n_{task},\tilde{y},y)=-\frac{1}{n_{task}}\sum y\cdot\log\tilde{y}+(1-% y)\cdot\log(1-\tilde{y}).

(2)

2.3.3 Individual-level: Attribute Validation

To focus on some co-occurring attributes for embeddings on both sides, we carefully design the attribute validation task at the individual level so that key attributes are better incorporated. The attribute validation task leverages embeddings from previous layers to predict attributes of users and jobs. Here Skill is chosen as the key individual information. The reason is that under the person-job fit scenario, a match between a job and a user profile highly depends on the skills commanded by the user and required by the job. The label is the unified “skill_ids” that are extracted by LinkedIn service from the skills section. Representations for attribute validation are obtained from the individual-level attention fusion layer in two steps: skill section will be removed from $\tilde{E}^{u}$ during pooling (denoted as $\tilde{E}^{u}_{msk}$ ), and it is also masked when conducting self-attention:

E^{\prime}_{u,msk}\!\!=\!\tilde{E}^{u}_{msk}\!+\!\textbf{Attention}(Q\!=\!% \tilde{E}^{u}_{msk},K\!=\!E^{u},V\!=\!E^{u}).

(3)

Predicted label $\tilde{y}_{Att}$ is generated through a single layer MLP on $E^{\prime}_{u,msk}$ . Multi-label one-versus-all loss is used to transform it into a series of binary classification problems, and the Attribute Validation objective can be denoted as $L(n_{Att},\tilde{y}_{Att},y_{Att})$ . A similar objective can be defined for job representations.

2.3.4 Interaction-level: Application Classification

Application Classification task is designed to predict whether a user will apply for a job in order to strengthen interactions between user profile embedding and job description embedding. It is worth mentioning that negative samples are randomly generated and accounts for 3/4 of the dataset. The reason is that the collected application data highly rely on job recommendations, and users only react to recommended jobs that are already considered to be suitable ones in the system. Practically, when recommended a job, the user can choose to skip/dismiss/save/apply for this job. The “skip” action is excluded while “apply” and “save” are considered positive and “dismiss” is negative. The final output of TAROT will be fed into a single layer MLP to generate the predicted label $\tilde{y}_{App}$ and trained with corresponding object $L(n_{App},\tilde{y}_{App},y_{App})$ .

2.4 Pretraining and Downstream Evaluation

Text from job descriptions and user profiles are co-pretrained with hierarchical tasks, and the overall objective can be formulated as:

L=\sum_{*}\lambda_{*}L_{*},\text{where }*\in\{MLM,Exp,Att,App\},

(4)

where $\lambda_{*}$ is the hyper-parameter for the corresponding pretraining task. We select two downstream tasks including job recommendations for users and user recommendations for jobs (recruiters).

3 Experiments

Task	Job Recommendation					Candidate Recommendation
Models	AUC	Recall@3	Precision@3	NDCG@3	MRR	AUC	Recall@3	Precision@3	NDCG@3	MRR
PJFNN	-	-	-	-	-	-	-	-	-	-
PJFNN+Bert	+2.538%	+2.280%	+0.899%	+2.027%	+0.582%	+0.334%	+1.464%	+1.569%	+1.768%	+0.735%
PJFNN+TAROT	+4.477%	+3.941%	+4.780%	+3.896%	+3.831%	+6.977%	+8.176%	+9.765%	+13.494%	+9.007%
BPJFNN	+4.658%	-0.732%	+2.130%	-1.211%	-0.436%	+2.765%	+2.990%	-0.959%	+2.475%	+0.827%
BPJFNN+Bert	+5.057%	+0.929%	+3.739%	-0.211%	+1.334%	+3.229%	+4.637%	+1.831%	+4.537%	+2.987%
BPJFNN+TAROT	+6.163%	+4.110%	+6.152%	+1.003%	+3.177%	+7.385%	+10.189%	+8.718%	+14.555%	+7.721%
APJFNN	+9.679%	+3.941%	+6.294%	+0.684%	+3.637%	+6.198%	+7.444%	+12.119%	+11.314%	+6.985%
APJFNN+Bert	+10.694%	+4.899%	+7.856%	+2.237%	+4.680%	+6.866%	+9.457%	+15.606%	+12.080%	+7.675%
APJFNN+TAROT	+11.891%	+6.278%	+10.554%	+4.396%	+5.941%	+8.295%	+14.216%	+18.309%	+17.030%	+10.386%

Table 1: Performance comparison on two downstream tasks. Results are relative improvements compared to PJFNN.

3.1 Experiment Settings

Task	Job Recommendation
Models	AUC	HR@1	NDCG@5	NDCG@25	MRR
w/o MLM	-4.9%	-11.0%	-8.3%	-10.0%	-8.7%
w/o EXP	-5.1%	-6.8 %	-4.5%	-8.0 %	-8.7%
w/o ATT	-4.6%	-2.3 %	-4.9%	-2.5 %	+0.4%
w/o APP	-9.5%	-13.7%	-11.3%	-9.3 %	-10.5%

Table 2: Multitask ablation study with TAROT as baseline.

Task	Job Recommendation			Candidate Recommendation
Models	AUC	NDCG@5	MRR	AUC	NDCG@5	MRR
OF	-	-	-	-	-	-
OF+BERT	+0.3%	-0.7 %	-1.9%	+1.8%	+0.6%	+0.6%
OF+TAROT	+6.0%	+12.9%	+6.0%	+5.5%	+4.3%	+4.5%

Table 3: Improvement to online service features. “OF” refers to online features used in LinkedIn system.

Dataset The data for pre-training is collected from user activity records of LinkedIn with anonymized user profiles and job descriptions for security. We filter out incomplete user profiles, and the training data contains over 800k job application records, including 193k users and 331k jobs. There are two downstream tasks.

For job recommendation task, the data is composed of job and user profile pairs. “Skip” and “Dismiss” actions are labeled negative, and “Save” and “Apply” are positive. The dataset contains 31k users and 54k jobs, with 150k samples. The candidate recommendation task recommends users to recruiters according to their posting jobs. When a user is recommended, the recruiter can contact or skip the candidate. The dataset contains 133k users and 19k jobs, with 150k samples. To prevent data leakage, we select a different user group than the training data in the data of these two tasks.

Compared Method We compare TAROT with variants of PJFNNs to examine the benefits of extracting semi-structured text. Hence, the compared methods are divided into three types: (1) PJFNNs [5, 6] are a series of models that are proven to be effective for person-job fit; (2) PJFNNs + BERT that uses BERT as a plugin for semantic embeddings. (3) PJFNNs + TAROT utilizes TAROT embeddings that is similar to PJFNNs+BERT.

Implementation Details We choose the small BERT model [20] with 512 hidden neurons. Adam is utilized as the optimizer and the learning rate is $10^{-4}$ . We use grid search strategy to find best hyper-parameters. AUC, top-3 recall rate (Recall@3), precision rate (Precision@3), Normalized Discounted Cumulative Gain (NDCG@3) and Mean Reciprocal Rank (MRR) are used to evaluate model performance. We take PJFNN as the foundation and other results are relative improvements against it.

3.2 Downstream Performance Comparison

The results are organized in Table 1. Compared to PJFNNs, additional BERT embeddings can provide improvements on most metrics, showing the superiority of incorporating semantics information from pretrain LLMs. We can also discover that TAROT significantly enhance the recommendation performance on both tasks, demonstrating that embeddings from the hierarchically co-pretraining framework are more expressive and informative than the pretrain-based semantic plugin BERT. Besides, TAROT not only achieves remarkable performance on the job recommendation task that is highly correlated with the pretraining Application Classification task, but also obtains impressive gains on the candidate recommendation task. Note that “headhunter” is not explicitly included in our framework and headhunters will post numerous job so it is also not in the individual-level. However, the candidate recommendation task is headhunter-oriented, which implies that even though there is no corresponding pretraining task, TAROT embeddings can still be beneficial to the generalization of models to unseen Person-Job fit downstream tasks.

3.3 Ablation Study

To evaluate the design of multitask training in our framework, we conduct ablation studies by removing different pretraining tasks, and observe the performance on the downstream job recommendation task as there are corresponding entire pretrain task sets. Here we add the top-1 hit rate (HR@1) metric and the results are shown in Table 2. “w/o App”, “w/o Att”, “w/o Exp” and “w/o MLM” refer to pretraining without the corresponding task. From the results, we can see that Experience Classification and Attribute Validation are essential to our downstream tasks as removing them will degrade the performance. The worst result in w/o App indicates that Application Classification plays the most critical role because it further empowers information interactions between job descriptions and user profiles. In summary, all the results prove the effectiveness of our multitask co-pretraining framework.

3.4 Improvement to Online Service Features

We also combine TAROT embeddings with features used in LinkedIn online service. From Table. 3 we can find that TAROT embeddings can provide additional gains to currently-used features and are more effective than BERT, which implies its value in practice. Notably, individual-level embeddings can be stored to speed up the inference in online products.

4 Conclusion

In this paper, we propose TAROT to provide expressive embeddings for person-job fit applications. To fully leverage the text and interaction information from job descriptions and user profiles, we design a hierarchical multitask co-pretraining framework for a better understanding of the semantic information and correlations of them. To evaluate the effectiveness, we conduct comprehensive experiments on the real data of LinkedIn with several baselines. The experimental results show that our framework can significantly improve downstream task performance and promote the online service feature in LinkedIn.

References

[1] Richard Layard, Stephen Nickell, and Richard Jackman, “The unemployment crisis,” 1994.
[2] Society For Human Resource Management, “Human capital benchmarking report,” 2016.
[3] Yingya Zhang, Cheng Yang, and Zhixiang Niu, “A research of job recommendation system based on collaborative filtering,” in 2014 Seventh International Symposium on Computational Intelligence and Design, 2014, vol. 1, pp. 533–538.
[4] Yao Lu, Sandy Ingram, and Denis Gillet, “A recommender system for job seeking and recruiting website,” 05 2013, pp. 963–966.
[5] Chuan Qin, Hengshu Zhu, Tong Xu, Chen Zhu, Liang Jiang, Enhong Chen, and Hui Xiong, “Enhancing person-job fit for talent recruitment: An ability-aware neural network approach,” 06 2018, pp. 25–34.
[6] Shuqing Bian, Wayne Xin Zhao, Yang Song, Tao Zhang, and Ji-Rong Wen, “Domain adaptation for person-job fit with transferable deep global match network,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, Nov. 2019, pp. 4810–4820, Association for Computational Linguistics.
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, June 2019, pp. 4171–4186, Association for Computational Linguistics.
[8] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
[9] Shuqing Bian, Xu Chen, Wayne Xin Zhao, Kun Zhou, Yupeng Hou, Yang Song, Tao Zhang, and Ji-Rong Wen, “Learning to match jobs with resumes from sparse interaction data using multi-view co-teaching network,” in Proceedings of the 29th ACM International Conference on Information and Knowledge Management, New York, NY, USA, 2020, CIKM ’20, p. 65–74, Association for Computing Machinery.
[10] Jiayi Liao, Xu Chen, and Lun Du, “Concept understanding in large language models: An empirical study,” 2023.
[11] Jiayi Liao, Xu Chen, Qiang Fu, Lun Du, Xiangnan He, Xiang Wang, Shi Han, and Dongmei Zhang, “Text-to-image generation for abstract concepts,” 2023.
[12] Nils Reimers and Iryna Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 3982–3992.
[13] Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, Yiming Yang, and Lei Li, “On the sentence embeddings from pre-trained language models,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 9119–9130.
[14] Lun Du, Fei Gao, Xu Chen, Ran Jia, Junshan Wang, Jiang Zhang, Shi Han, and Dongmei Zhang, “Tabularnet: A neural network architecture for understanding semantic structures of tabular data,” in KDD, 2021, pp. 322–331.
[15] Lun Du, Xu Chen, Fei Gao, Qiang Fu, Kunqing Xie, Shi Han, and Dongmei Zhang, “Understanding and improvement of adversarial training for network embedding from an optimization perspective,” in WSDM, 2022, pp. 230–240.
[16] Xu Chen, Yuanxing Zhang, Lun Du, Zheng Fang, Yi Ren, Kaigui Bian, and Kunqing Xie, “Tssrgcn: Temporal spectral spatial retrieval graph convolutional network for traffic flow forecasting,” in 2020 ICDM. IEEE, 2020, pp. 954–959.
[17] Xu Chen, Junshan Wang, and Kunqing Xie, “Trafficstream: A streaming traffic flow forecasting framework based on graph neural networks and continual learning,” in IJCAI, 2021.
[18] Xu Chen, Qiu Qiu, Changshan Li, and Kunqing Xie, “Graphad: A graph neural network for entity-wise multivariate time-series anomaly detection,” in ACM SIGIR, 2022, pp. 2297–2302.
[19] Zihang Dai, Guokun Lai, Yiming Yang, and Quoc Le, “Funnel-transformer: Filtering out sequential redundancy for efficient language processing,” 06 2020.
[20] Vincent Micheli, Martin d’Hoffschmidt, and François Fleuret, “On the importance of pre-training data volume for compact language models,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, Nov. 2020, pp. 7853–7858, Association for Computational Linguistics.