A Comprehensive Survey of Artificial Intelligence Techniques for Talent Analytics

Chuan Qin, Le Zhang, Yihang Cheng, Rui Zha, Dazhong Shen,
Qi Zhang, Xi Chen, Ying Sun, Chen Zhu,
Hengshu Zhu*, Hui Xiong This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. C. Qin, Y. Cheng, C. Zhu, and H. Zhu are with the Career Science Lab, BOSS Zhipin, Bei**g, China. E-mail: [email protected], [email protected], [email protected], [email protected]. L. Zhang is with the Business Intelligence Lab, Baidu Inc., Bei**g, China. E-mail: [email protected]. R. Zha and X. Chen are with the University of Science and Technology of China, Anhui, China. E-mail: [email protected], [email protected]. D. Shen and Q. Zhang are with the Shanghai Artificial Intelligence Laboratory. E-mail: [email protected], [email protected]. Y. Sun and H. Xiong are with the Hong Kong University of Science and Technology (Guangzhou), china. E-mail: [email protected], [email protected] H. Zhu and H. Xiong are the corresponding authors.

Abstract

In today’s competitive and fast-evolving business environment, it is a critical time for organizations to rethink how to make talent-related decisions in a quantitative manner. Indeed, the recent development of Big Data and Artificial Intelligence (AI) techniques have revolutionized human resource management. The availability of large-scale talent and management-related data provides unparalleled opportunities for business leaders to comprehend organizational behaviors and gain tangible knowledge from a data science perspective, which in turn delivers intelligence for real-time decision-making and effective talent management at work for their organizations. In the last decade, talent analytics has emerged as a promising field in applied data science for human resource management, garnering significant attention from AI communities and inspiring numerous research efforts. To this end, we present an up-to-date and comprehensive survey on AI technologies used for talent analytics in the field of human resource management. Specifically, we first provide the background knowledge of talent analytics and categorize various pertinent data. Subsequently, we offer a comprehensive taxonomy of relevant research efforts, categorized based on three distinct application-driven scenarios from different level: talent management, organization management, and labor market analysis. In conclusion, we summarize the open challenges and potential prospects for future research directions in the domain of AI-driven talent analytics.

Index Terms:

Artificial intelligence, talent analytics, talent management, organization management, labor market analysis

1 Introduction

In the world of volatility, uncertainty, complexity, and ambiguity (VUCA), talents are always precious treasures and play an important role for business success. To cope with the fast-evolving business environment and maintain competitive edges, it is critical for organizations to rethink how to make talent-related decisions in a quantitative manner. Thanks to the era of big data, the availability of large-scale talent data provides unparalleled opportunities for business leaders to understand the rules of talent and management, which in turn deliver intelligence for effective decision-making and management for their organizations [374, 361]. Along this line, as an emerging applied data science direction in human resource management, talent analytics has attracted a wide range of attention from both academic and industry circles. Specifically, talent analytics, also as known as workforce analysis or people analytics, focuses on leveraging data science technologies to analyze extensive sets of talent-related data, empowering organizations with informed decision-making capabilities that enhance their organizational and operational effectiveness. [206]. In practice, talent analytics plays a pivotal role in strategic human resource management (HRM), encompassing diverse applications such as talent acquisition, development, retention, as well as examining organizational behaviors and external labor market dynamics. Generally, the research directions of talent analytics can be divided into three categories, as illustrated in Figure 1, including talent management, organization management, and labor market analysis.

Refer to caption — Figure 1: Graphical abstract of this survey from data to the proposed methods.

To be specific, first, talent management is a constant strategic process of attracting and hiring the high-potential employees, training their skills, motivating them to improve their performance, and retaining them to keep organizational competitiveness. In this particular scenario, talent analytics primarily focus on individual-level analysis. For instance, it can help human resources managers find the right talents for different jobs in a practical way [334, 140, 444], and it can reasonably make the employee performance or turnover prediction [6, 233]. Second, organization management is the art of fostering collaboration among talents and guiding the organizations toward achieving success. In this scenario, talent analytics can diagnose the health of an organization and measure the organizational performance by leveraging various relationship information between talents or organizations, such as organizational structure, communication patterns, and project collaborations [446, 114]. It can also assist the organization in effectively structuring and optimizing teams [414, 13]. Third, talent analytics can be applied from an external and macro perspective, i.e., applied to labor market analysis scenario. It is crucial to devise talent and organizational strategies. For instance, by analyzing talent demands within the labor market, managers can effectively craft recruitment strategies [484, 467].

Historically, talent analytics is proposed within the conception of HRM around 1920s [191]. Talent analytics is usually manual in the early stages of HRM before 1970 [280]. In this stage, human resource systems (HRS) or human resource management systems (HRMS) is the main method for talent analytics [51]. There are various types of these systems like commitment system or control system [25]. The assessment of the talent performance with these systems are conducted by the human resource managers [138]. At the same time, talent assessment-related theories have formed like human capital theory and resource-based theory [191]. With the occurrence of mechanical automation, human resource information systems (HRIS) has arised around 1940s, however, before 1970, HRIS is based on the sorting and tabulating equipment, at the same time, the main functions of HRIS are kee** employee records automatically and there is no computer support [111]. However, from this stage, talent analytics could be supported by aggregated information. With the development of information systems motivated by more and more data generated during management, HRIS enhanced by computer system have been widely adopted in 1970s [59], in this stage, HRIS is a combination of the database, the computer application, the software and the hardware to record, manage, and operate human resource data. This development trend is verified by a survey, which confirms that 60% of Fortune 500 companies use HRIS to support daily HRM operations [36]. Besides, in this stage, people analytics, refers to a novel, quantitative, evidence-based, and data-driven approach to manage the workforce, is proposed to raise the efficiency of core human resource functions such as recruiting [142]. Some standard statistical methods are adopted in this step like correlation analysis, simple regressions and so on [229], this way is generally called descriptive analytics. At the same time, due to the rich functions of HRIS, many evaluation studies on the specific functions of software have been carried out, including organizational performance, turnover and so on [126, 21].

With the development of artificial intelligence (AI) algorithms, some advanced regression techniques, data mining, text mining, web mining, and forecast calculations are used in talent analytics around 2010s [117], these ways is generally called predictive analytics. Recent literature highlights the importance of more advanced analytical methods and emerging technologies in talent analytics. For instance, Dahlbom et al. [94] emphasise that “new types of data and different algorithms used in AI and machine learning solutions utilized in HRA [Human Resource Analytics]” (p. 123) will transform the field of people analytics. There are two main challenges to facilitate this transformation. On the one hand, talent analytics is facing digital disruption, which enables the availability of large-scale relevant data. For instance, Indeed, a world-renowned job search site, had 11.3 million active jobs as of January 2022 [61]. Meanwhile, Linkedin, the largest online professional network, had 774 million members from around 200 countries as of March 2022 [248], building up a wealth of labor market data. Moreover, numerous enterprises are setting up their Digital Human Resource Management Systems (Digital HRMS), enabling the collection, storage, and processing of a huge amount of talent and organizational information in a digital environment [393]. On the other hand, with the advent of talent-related big data, advanced AI techniques have rapidly revolutionized a series of research and practices in this field at an alarming rate, which in turn deliver intelligence for decision-making and management for their organizations. In this stage, the deep learning methods have enabled the new paradigm in person-job fit [334, 485, 46, 444] and person-organization fit [381, 383], so as to achieve the efficient and accurate talent selection and development. Text mining methods have been adopted in the employer brand analysis based on the large-scale labor market data [246, 245], which enable the forward-looking strategic plans created for the business. At the same time, several high-tech companies are gradually incorporating AI technologies into their HRMS. As an illustration, IBM leverages AI technology to achieve a remarkable 95 percent accuracy in predicting employees who are considering leaving their positions, which saved IBM $300 million in retention costs [345]. With the strong automation capability of large language model (LLM), autonomous people analytics is conducted around 2020s [315].

Accordingly, AI in talent analytics in this paper includes supervised learning, unsupervised learning, deep learning, reinforcement learning, knowledge representation, natural language processing, and so on [43]. These techniques construct different business capabilities in people analytics, which are automation of structured (or semistructured) work processes, engagement with employees and managers, decision-making through extensive analysis of a large amount of data, creation of novel outcomes [43]. This survey attempts to provide a comprehensive review of the rapidly evolving AI techniques for talent analytics. Based on our investigation, we first provide a detailed taxonomy of relevant data laying a data foundation for leveraging AI techniques to understand talents, organizations, and corresponding management better. Generally, talents’ behaviors reflect in three levels, including individual level, organizational level and market level. Accordingly, the research efforts of the AI techniques for talent analytics from corresponding three aspects, including talent management, organization management, and labor market analysis. Finally, we identify challenges for future AI-based talent analytics and suggest potential research directions.

Moreover, in order to help the readers learn more effectively, we highlight the systematic resources provided in this survey as follows,

•

Table I summarizes the data for talent analytics.
•

Table III summarizes the recent AI-based talent analytics efforts in the talent management scenario.
•

Table IV summarizes the recent AI-based talent analytics efforts in the organization management scenario.
•

Table V summarizes the recent AI-based talent analytics efforts in the labor market analysis scenario.

2 Data for Talent Analytics

Nowadays, as enterprises undergo an accelerated digital transformation, a large amount of talent analytics-related data has been accumulated. In this section, we will introduce the data collected across various scenarios, providing readers with a foundational understanding of the related research data and the motivation for model design. Generally, the data can be divided into internal data, which are collected from the internal enterprise management system, and external data, which are collected from the external labor market.

2.1 Internal Data

Based on the described objects, internal data can be broadly divided into three categories: recruitment data, employee data, and organizational data.

2.1.1 Recruitment Data

Recruitment data in pre-employment mainly includes the following types:

Resume: A resume or Curriculum Vitae (CV) is a document that outlines a person’s background, skills, and accomplishments, which plays a vital role in the recruitment process as it serves to facilitate talent screening and assessment [334, 363]. It serves as an important tool for job seekers to showcase their qualifications and suitability for a job position. Recently, a large amount of resume data, in either Word or PDF format, has accumulated with the development of online recruitment. As shown in Figure 3, a resume typically comprises structured information such as gender, age, and education, as well as semi-structured information like educational experience, work experience, and project experience. Accordingly, several resume parsing techniques have been developed to extract the redundant information [74, 75, 443]. On such basis, substantial efforts are posed in talent analytics with resume data from different perspectives. For instance, Yao et al. [442] introduced a keyphrase extraction approach to explore job seekers’ skills in resumes, and Pena et al. [323] used image information in resume data to improve screening performance and explore fairness issue. Moreover, several studies have proposed leveraging the text mining techniques to determine the matching degree between jobs and job seekers based to their resumes [334, 485, 46]. In addition, the resumes also encompass the career trajectories of the job seekers. As illustrated on the right side of Figure 3, the candidate’s profile showcases three job experiences, including a job change at Microsoft and work experience at Google. In this phase, Zhang et al. introduced the ResumeVis system to visualize the individual career trajectory and mobility within different organizations [460]. And the researchers further analyzed the sequential patterns of the career trajectory and proposed personalized career development recommendation [289, 465, 400].

Job Posting: A job posting is an advertisement for a vacant position that provides job seekers with information on the job description and requirements. The posting offers applicants a clear understanding of what the position is responsible for and what qualifications are necessary. Recently, the proliferation of online recruitment services has made it increasingly common to publish job postings as web pages. Figure 3 illustrates a typical job posting that comprises structured information, such as the salary range and education requirements, as well as the semi-structured content that includes job duties descriptions and abilities requirements. Nevertheless, it is still difficult to deal with such a large corpus of data by Human Resource (HR) experts manually. To this end, researchers have attempted to reduce the dependence on manual labor by using neural network-based techniques, particularly NLP, on voluminous job postings. As mentioned before, considerable effort has been posed in Person-Job Fit [334, 335, 46, 485, 263], which aims to match the job postings with suitable resumes. Moreover, Shen et al. leveraged the latent variable model to jointly model the job description, candidate resume, and interview assessment, which can further benefit several downstream applications such as person-job fit and interview question recommendation [364, 363]. In order to reduce the expense of manual screening, researchers also extracted the job entities from postings and generated interview questions automatically [336, 365, 333, 239]. Apart from these in-firm applications, some studies are carried out to provide comprehensive insights into the global labor market. For instance, researchers have proposed several data-driven methods for salary analysis across different companies and positions [210, 82, 288]. Zhang et al. utilized large-scale job postings from one of the largest Chinese online recruitment websites and forecast fine-grained talent demand in the recruitment market [467]. Moreover, some studies aim to measure the popularity of job skills and forecast their evolving trends [426, 422]. Along this line, Sun et al. further focus on measuring the values of job skills based on massive job postings, contributing to the quantitative assessment of job skills [382].

Interview-related Data: Interview-related data is typically collected during the interview process and serves the purpose of evaluating applicants’ overall qualifications for the position they are applying for. In general, an interview can be conducted either in-person or through video, resulting in textual or video-based assessments, respectively. Both of these two kinds of data enable comprehensive evaluations for the candidates and facilitate the integration of AI in HR. To address the subjectivity of traditional interviews, Shen et al. utilized the latent variable model to explore the relationship among job descriptions, candidate resumes, and textual interview assessments [364, 363]. The results provide an interpretable understanding of job interview assessments. Indeed, textual interview assessments within a company are usually private and sensitive, whereas video assessments draw more attention. For instance, several studies extract multimodal features from the videos for automatic analysis of job interviews [78, 80]. In addition, Hemamou et al. proposed a hierarchical attention model to predict the employability of the candidates using multimodal information, including text, audio, and video [172]. Along this line, Chen et al. leveraged a hierarchical reasoning graph neural network to automatically score candidate competencies using textual features in asynchronous video interviews [77].

2.1.2 Employee Data

Regarding the development of employees within a company, a significant amount of employee data has been accumulated, including training records and individual work outcomes. An overview of employee-related data is shown in Figure 4.

Employee Profiles: Employee profiles typically describe an individual based on two main aspects: demographic characteristics and individual work outcomes. In specific, the former branch includes characteristics such as age, gender, and education levels [322], which can be used to enhance employee representations and benefit various downstream analyses, such as career mobility prediction [337, 477] and performance forecasting [271]. In addition to these static variables, individual work outcomes depict the dynamic career development from different dimensions, such as performance appraisals, promotion, and turnover records. In particular, a performance appraisal is a systematic evaluation of an employee’s job performance and productivity that is typically conducted by line managers. Besides, the promotion and turnover records show employee movements within and across companies, respectively. All of this information contributes to further insights into employee dynamics. For instance, researchers have leveraged performance appraisals to identify the high-potential talents within a company [447]. Li et al. utilized static profiles, performance appraisals, and reporting lines of employees to model career development within a company, focusing on turnover and career progression [233]. Sun et al. proposed to capture the dynamic nature of person-organization fit based on individual profiles, reporting lines, and communication records [381]. To investigate the contagious effect of turnovers, researchers have utilized both employee profiles and turnover data [387, 386]. Furthermore, Hang et al. leveraged five kinds of standardized data, including employee turnover records, to predict the turnover probability and period [165].

Training Records: Employee training is a program designed to improve the performance of employees by equip** them with specific skills. Ongoing employee training has proven to be crucial in attracting and retaining top talent [351]. Typically, the training record describes the learning path of an employee, which is a sequence of different skills. Based on these training records, considerable effort has been devoted to exploring the learning patterns of employees. For instance, Wang et al. utilized both learning records and skill profiles of employees from a high-tech company in China to develop a personalized online course recommendation system [402, 401]. Along this line, Srivastava et al. collected employees’ training and work history from a large multinational IT organization to provide personalized next training recommendations [377]. In addition, some researchers also provided insights into employee competency study [234, 421, 252, 177]. For instance, multi-dimensional features, including learning and training dimensions, were collected from a Chinese state-owned enterprise to provide competency assessment for employees [252].

2.1.3 Organizational Data

An organizational structure is a system that outlines how activities are directed toward the achievement of organizational aims [330], which plays an important role in decision-making and knowledge management. Generally, an organization is commonly represented as a hierarchical tree structure, which can take on diverse forms, such as matrix, flat, and network structures. Figure 5 shows several common types of organizational structures. Typically, existing studies explore these complex structures from various dimensions, such as reporting lines and in-firm social networks.

Reporting lines are generally the most representative aspect of an organizational structure, which delineates how authority and responsibility are allocated in an organization. Regarding this point, Sun et al. developed an organization structure-ware convolutional neural network to hierarchically extract compatibility features for measuring person-organization fit and its impact on talent management [381, 383]. Nevertheless, due to privacy concerns, mainstream studies utilize in-firm social networks for human resource management. In general, an in-firm social network can be formed from email or Instant Messaging (IM) records across employees. For example, text-based communication has been used by several machine learning classifiers to identify group mood [213]. Besides, Cao et al. leveraged the lasso regression model to explore team viability using text conversations of online teams [62]. In addition to social networks, researchers have also taken other information into account. For example, Ye et al. utilized both email communication and a high-potential talent list to identify employees with high potential [447]. Along this line, Teng et al. further utilized datasets from three sources for organizational turnover prediction, including profile and turnover, social network, and job levels [386].

TABLE I: The table of collected papers related to talent analytic-related data.

Categories	Data	Reference
Internal: Recruitment	Resume	[442, 323, 460, 289, 465, 400]
	Job Posting	[334, 46, 485, 467, 426, 422, 382]
	Interview-related	[364, 363, 78, 80, 172, 77]
Internal: Employee	Employee Profiles	[447, 233, 381, 387, 386, 165, 465]
Internal: Employee	Training Records	[402, 401, 377, 234, 421, 252, 177]
Internal: Organization	Reporting Lines	[381, 383]
Internal: Organization	In-firm Social Network	[213, 62, 447, 386]
External	Social Media	[188, 376]
External	Job Search Websites	[246, 245, 31, 357, 318]

2.2 External Data

Apart from the aforementioned internal data, external sources also contribute to a comprehensive understanding of the labor market, which can be broadly classified into two categories: social media platforms and job search websites.

Social Media: Widely-used social media platforms that contribute to a comprehensive understanding of the labor market include Twitter ¹¹1https://twitter.com, Facebook ²²2https://www.facebook.com, and news reports. With the help of NLP and Topic Model techniques [49], numerous studies have been carried out to explore the semantic information in this corpus. For example, more than 60,000 tweets related to nine energy companies were collected for sentiment analysis expressed on Twitter [188]. To gain further insight into the impact of public opinion, Spears et al. [376] collected earnings reports and news articles spanning eight years from four companies. The results indicate that companies may face a decline in valuation when they receive negative publicity.

Job Search Websites: Recent years have witnessed the rapid growth of job search websites, such as Indeed ³³3https://www.indeed.com, LinkedIn ⁴⁴4https://www.linkedin.com, and Glassdoor ⁵⁵5https://www.glassdoor.com. Specifically, Indeed and Glassdoor allow users to comment on a company, providing an overall understanding of the employer brand. For instance, Lin et al. [246, 245] collaboratively modeled both textual (i.e., reviews) and numerical information (i.e., salaries and ratings) for learning latent structural patterns of employer brands. In addition, Bajpai [31] leveraged the data from Glassdoor to perform aspect-level sentiment analysis. Along this line, large-scale reviews of Fortune 500 companies are collected to identify topics that matter to employees [357]. Differently, LinkedIn provides a wide range of business services, including job listings, professional profile creation, and career development services, with personal profiles being the most analyzed, as they describe users’ employment history. For example, Park et al. [318] used LinkedIn’s employment history data from more than 500 million users over 25 years to construct a labor flow network of over 4 million firms worldwide, demonstrating a strong association between the influx of educated workers and financial performance in detected geo-industrial clusters.

Furthermore, there are also several third-party business investigation platforms that offer detailed information about companies and their board members’ relationships, such as Crunchbase ⁶⁶6https://www.crunchbase.com, Owler ⁷⁷7https://www.owler.com, Tianyancha ⁸⁸8https://www.tianyancha.com and Aiqicha ⁹⁹9https://www.aiqicha.com. These details can be viewed as complementary information to the job search websites. Building upon this foundation, one can gain deeper insights into the aligned companies and conduct more relevant research, such as analyzing cooperative and competitive relationships [96] and providing investment target recommendations [81].

2.3 Data Processing

2.3.1 Data Collection

The source of recruitment data can be categorized into two types: internal data and external data. Correspondingly, data collection methodologies align with these source categories.

Internel Data. The Current business environment is typically dependent on data systems [219]. Internal data are collected from the internal enterprise management systems, which are known as enterprise resource planning (ERP) system [362], customer relationship management (CRM) system [127] and applicant tracking system (ATS) system [300]. ERP systems are comprehensive business management tools integrating various functions such as finance, sales, materials management, HR, production planning, and supply chain. CRM systems facilitate customer interaction and communication, encompassing customer information management, sales opportunities, and customer service. An ATS is computer software that human resource departments use to process the overwhelming number of applications they receive for job openings. These systems orderly store recruitment, employee, and organizational data, facilitating efficient collection and processing.

Externel Data. As online services rapidly evolve, a growing number of individuals are turning to social media and job search websites to exchange job-related information and explore employment opportunities. These interactive platforms host a vast array of talent information due to their extensive user base. Additionally, third-party business investigation platforms provide detailed insights into companies and the relationships of their board members. Data acquisition methods on these platforms vary. Third-party data collection websites typically aggregate data from their participating members. Meanwhile, web crawlers [215] can extract rich information from website pages, provided legal and regulatory compliance is ensured. Moreover, many job search websites retain substantial amounts of user data that isn’t publicly disclosed. Typically, this data can be utilized for scientific research purposes following encryption and other privacy safeguards.

2.3.2 Data Preprocessing

After collecting a large amount of recruitment data, it is essential to preprocess the data for downstream applications, especially removing noisy, redundant, irrelevant, and potentially toxic data. In this part, we review the detailed data preprocessing strategies to improve the quality of the collected data according to various data types.

Structured Data. Structured data, in simple terms, is a database such as ERP system [120], that has a standardized format for efficient access by software and humans alike. Besides the information of employees and companies are orderly collected into the database. On some data collection websites, a lot of user information is also strictly stored through databases, such as records of interactions between users and the platform, including clicking and browsing, etc. Since these data have been stored in a standardized format, filtering and integration of corresponding data can generally be achieved by connecting different tables and setting data filtering conditions on the tables [32].

Semi-Structured Data. Semi-structured data, or partially structured data, diverges from the conventional tabular format characteristic of relational databases or other tabular data forms [390]. Instead, it incorporates tags and metadata to delineate semantic elements and establish hierarchical relationships among records and fields. This type of data is prevalent in the labor market [370], encompassing employee resumes [461], individual interviews [56], job postings [90], and web pages [106]. Due to the absence of standardized formats and the diversity of semi-structured data types, it necessitates thorough exploration of commonalities, extraction of multidimensional information, and comprehensive filtering to transform it into structured data. For example, Sun et al. collected job postings from an online recruitment website [382]. On this website, each job opening is displayed in HTML, which contains information of salary range, company, location, time, and job description text. They parsed the HTML and obtained structured job posting information.

Unstructured Data. Unstructured data [396] refers to information that does not conform to the conventional row-column structure found in traditional databases. Within the labor market context, data primarily manifests as text, comprising a blend of structured and unstructured fields. Structured fields denote specific categories like job titles (e.g., occupation), location, etc., while unstructured fields provide a broader description of vacancy content. Approximately 80% of data held by firms today is unstructured [35], expanding at a rate fifteen times faster than structured data. While some approaches skip the processing of natural text, opting instead for direct classification tasks on such text, as observed in the classification of web job vacancies [55], others require extraction and subsequent processing of information from unstructured data into structured data. For instance, extracting skill information from job descriptions [83] involves the identification of skill words through regular expression matching. These identified words are then subjected to expert evaluation for further refinement, resulting in a set of relevant skill words for each job description. Subsequent statistical analysis across all jobs yields a comprehensive frequency distribution of skill words as the structured data, facilitating downstream tasks such as skill demand forecasting.

2.3.3 Data Cleaning and Debias

After initial data preprocessing, standard structured data is obtained. However, given the potential for noise introduced during the data acquisition process, coupled with non-uniform acquisition methods, the provided data may not be of high quality. Therefore, a cleanup and debiasing of the data becomes imperative. In this part, we first introduce several data quality issues commonly seen in AI-driven talent analytics. Subsequently, several clean and debias methods are introduced to solve these issues.

Data Quality Issues. There are various data quality issues: missing data, duplicated data, extraneous data, and inconsistent data [93]. These issues will introduce biases into analyses and lead to inaccurate conclusions. We introduced these issues as follows:

•

Missing Data: Missing data occurs when essential data is absent from a dataset, which can result from factors like data corruption, or failure to capture specific information. Within the talent market, missing values often stem from inconsistent information sources. For instance, job postings that should contain recruitment requirements and descriptions may be missing key fields [184]. Resumes also frequently omit crucial information such as email addresses and physical addresses [360].
•

Duplicated Data: Duplicated data refers to the presence of identical or nearly identical records within a dataset, which can arise from data entry errors, erroneous dataset merging, or technical malfunctions. Duplicated data can distort statistical analyses and exaggerate the significance of certain data points. In the talent market, this issue can arise due to data sources providing overly homogeneous information. Repetitive job postings may be erroneously interpreted as multiple postings for the same position in certain analysis and prediction scenarios [382].
•

Extraneous Data: Extraneous data comprises irrelevant or unnecessary information within a dataset, often included mistakenly due to human errors or incorrect data integration processes. Such data can complicate analyses, waste computational resources, and yield inaccurate results. This data often requires further filtering to retain only relevant fields and eliminate irrelevant information, as noted in [34, 44]. Redundancies in job offers can hinder the accuracy of downstream classification tasks [44].
•

Inconsistent Data: Inconsistent data encompasses conflicting or contradictory information within a dataset, stemming from sources like data entry errors, incompatible formats, or changes in data collection methods over time [184, 34, 131]. Such inconsistencies impede meaningful insights and necessitate thorough validation and cleansing to ensure data integrity and accuracy. Common inconsistencies include misspellings and variations in job titles, which require resolution to maintain consistency across professional documents [34, 131]. Besides, erroneous labeling of job postings as job titles is also a common problem [131].

TABLE II: The table of collected talent analytic-related open datasets.

Categories	Dataset	Link	Note
Internal:Recruitment	Kaggle-Entity_Recognition_Resumes	https://www.kaggle.com/datasets/ dataturks/resume-entities-for-ner	220 resumes; 10 categories; Resume Understanding
	LinkedIn-Job-Scraper	https://www.kaggle.com/datasets /arshkon/linkedin-job-postings/data	33,000+ job postings; 27 valuable attributes including the title, job description, salary, location, etc
	Job Dataset	https://www.kaggle.com/datasets/ ravindrasinghrana/job-description-dataset	synthetic job postings; 23 attributes including job title, job salary, job skill, etc
	Linkedin Jobs & Skills (2024)	https://www.kaggle.com/datasets/asaniczka/ 1-3m-linkedin-jobs-and-skills-2024	12,96,381 job postings; skills map**, job recommendation systems
Internal:Employee	HR Analytics	https://www.kaggle.com/datasets /colara/hr-analytics	14999 employees; 10 attributes,including satisfaction_level, salary, etc; turnover prediction
	Human Resources Data Set	https://www.kaggle.com/datasets/ rhuebner/human-resources-data-set/data	311 records; 36 attributes, including name, salary, etc

Cleaning and Debias Methods. Data cleaning and debias is an iterative process tailored to the requirements and semantics of specific analysis tasks [218]. This process consists of transforming raw data into consistent data that can be analyzed. Herein, we delineate several frequently employed methodologies in talent analytics for data cleaning and debias.

•

Data Selection: Data selection involves the meticulous identification and extraction of pertinent data subsets from a broader dataset, guided by specific criteria or requisites [251]. This approach serves to streamline the data analysis process by honing in on the most relevant information, thereby mitigating noise and extraneous data. For instance, Zhang et al. [467] opted to discard company-position pairs with a monthly averaged talent demand of less than 2, considering scenarios where certain companies may not extensively recruit for particular research positions, resulting in consistently low demand. Shao et al. conducted data cleaning by eliminating job posts and resumes with incomplete attributes [360]. Balaji et al. focused on extracting useful fields from job titles to curate the most pertinent and distinctive set of work activities corresponding to selected company job titles [34].
•

Data Filtering: Data filtering encompasses the systematic elimination or exclusion of undesirable or extraneous data from a dataset, thereby diminishing noise and refining its quality. In typical text cleaning tasks, the compilation of a stop word list proves indispensable for sieving out irrelevant information [156]. Moreover, the adoption of regular expressions to delineate fields and detect anomalies is widely embraced in personalized resume-job matching systems [156] and job search engines [301].
•

Data Clustering: Data clustering involves grou** similar data points together based on certain characteristics or features. Wakchaure et al. [398] leverages the Levenshtein edit distance, a metric gauging string similarity, to designate matches exceeding 0.85 as identical individuals. Gaikwad et al. [131] employs Levenshtein edit distance for duplicate detection within XML documents. Sun et al. [382] utilize text embedding and measure edit distance to approximate similar job descriptions.Zhang et al. [467] employ a classification approach to categorize original job titles into 16 distinct categories, thereby mitigating noise and standardizing job titles.
•

Data Synthesis: Data synthesis encompasses the generation of fresh data points to augment extant datasets, proving instrumental in addressing gaps or amplifying the efficacy of analysis and model functionalities [184]. For instance, Magron et al. [267] devised synthetic job postings to refine skill alignment, while Skondras et al. [375] utilized Large Language Models to craft synthetic resume data, thereby bolstering job description classification.
•

Data Normalization: Data normalization is a crucial procedure aimed at standardizing the scale or distribution of data values within a dataset, thereby facilitating precise comparisons and analyses while ensuring uniformity across variables. It finds widespread application in various domains, particularly in time series forecasting. For instance, Liu et al. conducted normalization operations, which involve subtracting the minimum value of each node and dividing it by the difference between the maximum and minimum values of the node [249]. Similarly, Zhang et al. normalized the number of job transitions along the company axis to achieve consistency [466].

2.3.4 Limitations

Existing data and processing methods in the field of talent analytics continue to face many limitations and challenges, which constrain the development of AI-based approaches. We summarize these issues as follows.

•

Lack of benchmark datasets: Most current research in talent analytics heavily depends on proprietary data related to job applicants, employees, and organizations. The sensitive nature of this data often prevents it from being publicly accessible, severely restricting the development of publicly available benchmark datasets. This limitation impedes the standardization of problem definitions and comparative analyses of methodologies, thus slowing down technological advancements in the field. While some job-related data has been made open-source, it remains limited in scale and temporal coverage. Consequently, there is an urgent need to create comprehensive open-source benchmark datasets for pivotal tasks within this field. We present several open datasets, as shown in Table II, primarily related to resume parsing and employee turnover prediction. These datasets can help researchers in develo** standardized datasets for uniform comparisons. Additionally, the data anonymization and de-identification methods require more extensive consideration to facilitate the construction of more open-source datasets for various scenarios.
•

Diversity deficiencies in datasets: The datasets commonly used in AI-based talent analytics often lack adequate consideration of diverse populations, especially underrepresented minorities. This deficiency hampers efforts to evaluate and ensure algorithmic fairness. Models developed from such datasets may exhibit biases, leading to inequitable outcomes among various demographic groups. Therefore, it is essential to integrate more diverse datasets to promote fairness and enhance the overall performance of AI algorithms in talent analytics.
•

Challenges with subjective data: In many talent analytics scenarios, such as employee promotion decisions, outcomes are frequently based on the subjective judgment of supervisors. This practice can introduce inherent biases and noise into the real datasets collected from these decisions. Traditional data preprocessing techniques often fail to detect and address these anomalies effectively. Therefore, there is a pressing need for specialized noise detection algorithms or the development of robust AI-based talent analytics models that can handle such irregularities. These approaches are essential to ensure the accuracy and fairness of the analytics outcomes.

3 Talent Management

Talent management, which focuses on placing the right person in the right job at the right time, has emerged as a predominant human capital topic in the early twenty-first century [66, 63]. Specifically, the management process includes the whole process of talent entering the organization to development. Accordingly, first, we describe various intelligent talent recruitment scenarios, such as job posting generation [254, 332] and talent searching [162, 274, 160]. After entering the organization, in order to ensure the sustainable development of talent, timely and accurate feedback is critical. As a result, second, we discuss two primary issues in talent assessment: interview question recommendation [363, 336, 365] and assessment scoring [304, 171, 77, 234]. Finally, from the whole development process, career development is important for managing human capital resources and individual development. Accordingly, we outline several post-employment career development problems, including course recommendations [377, 402, 401] for employee training and employee dynamics analysis [477, 259, 386]. In the following sections, we will delve into these issues in detail.

TABLE III: The table of collected papers related to talent management.

Task

Method

Data

Reference

Talent Recruitment

Job Posting Generation

RNN

Job posting

[254, 332]

Job Posting Generation

LLMs

Job posting

[260, 52]

Resume Understanding

Rule-base method

Resume

[216]

Resume Understanding

HMM

Resume

[449]

Resume Understanding

SVM

Resume

[449, 89]

Resume Understanding

CRF

Resume

[321, 74]

Resume Understanding

LSTM,CNN

Resume

[27]

Resume Understanding

LSTM,CRF

Resume

[327, 327, 240, 262]

Resume Understanding

RoBERTa,GCN

Resume

[411]

Resume Understanding

Multimodal pre-trained model

Resume

[443, 196]

Talent Searching

Keywords matching

Query, Resume

[274]

Talent Searching

Keywords matching, knowledge graph

Query, Resume

[407]

Talent Searching

Traditional classifiers

Query, Resume

[19, 311]

Talent Searching

Topic model, multi-armed bandit algorithm

Query, Resume

[139]

Talent Searching

Learning-to-rank algorithm

Query, Resume

[161, 162]

Talent Searching

DNN, learning-to-rank algorithm

Query, Resume

[340, 440]

Person-Job Fitting

Latent variable model

Job posting, resume

[273]

Person-Job Fitting

CNN

Job posting, resume

[485, 475, 169, 481]

Person-Job Fitting

RNN

Job posting, resume

[334, 335, 435]

Person-Job Fitting

CNN, RNN

Job posting, resume

[46, 263, 197]

Person-Job Fitting

BERT

Job posting, resume

[226, 2, 250, 360],

[208, 123, 152, 147]

Person-Job Fitting

GNN, BERT

Job posting, resume

[45, 444, 409, 437, 73]

Person-Job Fitting

Attention mechanism

Job posting, resume

[168, 129, 473, 178]

Person-Job Fitting

Reinforcement learning

Job posting, resume

[128]

Person-Job Fitting

Federated learning

Job posting, resume

[472]

Person-Job Fitting

LLMs

Job posting, resume

[479, 141, 419, 115, 450]

Person-Job Fitting

Ranking

Job posting, resume,

social media

[122, 121]

Person-Job Fitting

K-means

Social media

[145]

Person-Job Fitting

Traditional classifiers

Social media

[290]

Person-Job Fitting

Gamma-Poisson model

User behavior

[53]

Talent Assessment

Interview Question Recommendation

Topic model

Job posting, resume,

assessment report

[364, 363]

Interview Question Recommendation

Knowledge graph

Job posting, resume,

search engine log

[336]

Interview Question Recommendation

Knowledge graph,

integer linear programming

Question bank

[102]

Interview Question Recommendation

BERT

Job posting

[365]

Interview Question Recommendation

GCN

KSC, search engine log

[333]

Assessment Scoring

Regression models

Interview videos

[307, 303, 304]

Assessment Scoring

Doc2Vec

Interview videos

[78, 80]

Assessment Scoring

GNN

Interview records

[77]

Assessment Scoring

Attention mechanism

Interview videos

[172, 171, 174]

Assessment Scoring

Transformer

Interview videos

[372, 395]

Assessment Scoring

Adversarial learning

Interview videos

[173]

Assessment Scoring

traditional classifiers

Employee profiles

[252, 177, 234]

Career Development

Course Recommendation

Collaborative filtering

Trainees’ profiles

[402, 401]

Course Recommendation

Markov decision process

Learning records

[377]

Course Recommendation

Neural Network

Learning records

[320]

Course Recommendation

Reinforcement Learning

Learning records

[480]

Course Recommendation

KG-based Transformer

Learning records, Knowledge Graph

[439]

Promotion Prediction

Traditional classifiers

Social network

[453]

Promotion Prediction

Traditional classifiers

Personal profile,

job posting log

[259]

Promotion Prediction

Multiple classification

Employee’s Detail Record

[253]

Promotion Prediction

Survival analysis

Personal profile, career paths

[233]

Turnover Prediction

Traditional classifiers

HR dataset

[373, 302, 10]

Turnover Prediction

GNN, RNN

profile, turnover records

[387]

Turnover Prediction

neural network

profile, turnover records

[386]

Turnover Prediction

Traditional classifiers

HR Information Systems

[6]

Turnover Prediction

GNN, RNN, survival analysis

job description, organizational tree,

profile, turnover records

[165]

Job Satisfaction

Traditional classifiers

Personal profile,

job profile

[22]

Job Satisfaction

Traditional classifiers

Social media

[346]

Career Mobility Prediction

RNN

Career paths

[236, 170]

Career Mobility Prediction

Attention mechanism