HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: boldline

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2403.04160v1 [cs.IR] 07 Mar 2024

Improving Retrieval in Theme-specific Applications
using a Corpus Topical Taxonomy

SeongKu Kang University of Illinois at Urbana-ChampaignILUSA [email protected] Shivam Agarwal University of Illinois at Urbana-ChampaignILUSA [email protected] Bowen ** University of Illinois at Urbana-ChampaignILUSA [email protected] Dongha Lee Yonsei UniversitySeoulRepublic of Korea [email protected] Hwanjo Yu Pohang University of
Science and Technology
PohangRepublic of Korea
[email protected]
 and  Jiawei Han University of Illinois at Urbana-ChampaignILUSA [email protected]
(2024)
Abstract.

Document retrieval has greatly benefited from the advancements of large-scale pre-trained language models (PLMs). However, their effectiveness is often limited in theme-specific applications for specialized areas or industries, due to unique terminologies, incomplete contexts of user queries, and specialized search intents. To capture the theme-specific information and improve retrieval, we propose to use a corpus topical taxonomy, which outlines the latent topic structure of the corpus while reflecting user-interested aspects. We introduce ToTER (Topical Taxonomy Enhanced Retrieval) framework, which identifies the central topics of queries and documents with the guidance of the taxonomy, and exploits their topical relatedness to supplement missing contexts. As a plug-and-play framework, ToTER can be flexibly employed to enhance various PLM-based retrievers. Through extensive quantitative, ablative, and exploratory experiments on two real-world datasets, we ascertain the benefits of using topical taxonomy for retrieval in theme-specific applications and demonstrate the effectiveness of ToTER.

Document retrieval; Topical taxonomy; Theme-specific application
journalyear: 2024copyright: rightsretainedconference: Proceedings of the ACM Web Conference 2024; May 13–17, 2024; Singapore, Singaporebooktitle: Proceedings of the ACM Web Conference 2024 (WWW ’24), May 13–17, 2024, Singapore, Singaporedoi: 10.1145/3589334.3645512isbn: 979-8-4007-0171-9/24/05ccs: Information systems Information retrievalccs: Information systems Specialized information retrievalccs: Information systems Document topic models

1. Introduction

Pre-trained language models (PLMs) have greatly improved document retrieval (Izacard et al., 2021; Xiong et al., 2021; Khattab and Zaharia, 2020; Karpukhin et al., 2020). The PLM-based retrieval models are first pre-trained on the massive textual corpora to grasp language understanding. Subsequently, they are fine-tuned using vast datasets of annotated query-document pairs, which enables the models to capture their semantic similarities. While successful in general domains like web search which consist of a broad user base, they are often limited in specialized applications with specific themes.

Theme-specific applications are specialized areas or industries where retrieval tasks are focused on a specific theme (e.g., academic paper search, product search in e-commerce). Retrieval in theme-specific applications poses three challenges spanning specialized terminology and niche content (C1), limited contexts of user query (C2), and specialized user interests and search intents (C3).

C1: Theme-specific domains often have specialized terminologies, which are not frequently included in the general text corpus. For example, Table 1(a) shows that an academic paper includes many technical terms specific to certain research fields, such as “proof of retrievability” and “cryptographic proof”. PLM-based retrievers trained on general text corpora often lack an inherent understanding of domain-specific specialized and niche terminologies (Dong et al., 2022).

C2: Users familiar with the domain often omit contexts they believe are naturally implied in their query. For example, in product search, users enter a query such as “RTX 3090” without adding contexts such as “graphics cards”. Table 1(a) shows queries from domain experts may skip over general contexts such as “cryptography” or “computer security”. Omitted terms hinder the model’s ability to fully comprehend the query, leading to imprecise retrieval outcomes. Inferring missing contexts is more challenging in theme-specific applications as it often requires domain-specific knowledge.

Table 1. Examples of retrieval in theme-specific applications. We use Contriever-MS (retriever) and MiniLM-L-12 (reranker). Contents closely related to the query are denoted in bold. Details of topic class and core phrase discovery are provided in section 4.
(b) Product domain

Query

Provable data possession at untrusted stores

Query

#1 black natural hair dye without ammonia or peroxide

Document A (label: relevant rank: top-173)

Pors: proofs of retrievability for large files. In this paper, we define and explore proofs of retrievability (PORs). … A POR may be viewed as a kind of cryptographic proof of knowledge (POK). … We view PORs as an important tool for semi-trusted online archives. Existing cryptographic techniques help users ensure the privacy and integrity of files they retrieve. …

Document A (label: relevant rank: top-70)

ONC NATURALCOLORS (1N Black) 4 fl. oz. (120 mL). Healthier permanent hair dye with certified organic ingredients, ammonia free, vegan friendly, 100 gray coverage. … Cruelty-free and vegan. It is time to make the clean choice.

Document B (label: irrelevant rank: top-11)

Roux Fanci-full Rinse 16 Hidden Honey. Tones and enhances gray and blonde hair. Rinses in and shampoos out. No ammonia or peroxide. … 15 applications per bottle, temporary hair color, 15 ounce bottle.

\hlineB2.5 ToTER rank: top-10

Topic classes: cryptography, trusted computing, digital content, computer network, computer security, computer science

ToTER rank: top-5 (Doc.A) top-32 (Doc.B)

Topic classes: hair color, hair coloring products, hair care, beauty & personal care

Core phrases: encryption, access control, security, key, server

Core phrases: dye, permanent, lasting, permanent hair color, ammonia free

\hlineB2.5

C3: Users in theme-specific applications have more specialized interests and intents compared to general web searches. For example, researchers may want to find papers within a specific field of study to discern a particular research trajectory. In product search, users often filter results based on specific product attributes. For example, Table 1(b) shows that both documents are somewhat relevant to the query as both of them are about ammonia-free hair color products. However, the query targets hair dye with lasting effects, instead of hair rinse with temporary effects. These specialized search intents are not effectively captured by models trained on general corpora.

Accumulating ample labeled data can mitigate these challenges to some extent. However, the creation of such datasets in theme-specific applications is particularly challenging due to the need for domain expertise (e.g., academic domain) and the proprietary nature of user logs in specialized applications (e.g., e-commerce) (Chaudhary et al., 2023; Li et al., 2023). As a result, PLM-based retrieval models often struggle to accurately capture relevance in theme-specific applications (Thakur et al., 2021).

To improve retrieval without relying on labeled data, we propose to use a corpus topical taxonomy (Huang et al., 2020; Lee et al., 2022a; Meng et al., 2020; Shang et al., 2020; Zhang et al., 2018), which has been extensively studied for organizing topics in a corpus. A corpus topical taxonomy outlines the latent topic hierarchy within the corpus as a tree structure, where each node is a topic class represented by a cluster of semantically coherent terms describing the topic, as shown in Figure 1. Recent taxonomy construction studies (Huang et al., 2020; Lee et al., 2022a; Arous et al., 2023) have effectively reflected user-interested aspects, drawing from a foundational seed taxonomy rooted in human knowledge of the application (e.g., fields of study from Mircosoft Academic (Shen et al., 2018)). The constructed taxonomy can be subsequently employed to provide additional clues to link queries and documents by discerning their topical relatedness and supplementing the missing contexts.

We propose Topical Taxonomy Enhanced Retrieval (ToTER) framework, which systematically leverages the corpus taxonomy to complement the semantic matching of PLM-based retrieval. The taxonomy provides a high-level topic hierarchy of the entire corpus. To harness this corpus-level knowledge for retrieval, we first link it to individual documents. Specifically, ToTER first conducts topic class relevance learning to discern the relevance of each document to each topic class node in the taxonomy. We formulate this step as an unsupervised multi-label classification problem without document-topic labels. ToTER introduces a new silver label generation strategy along with a new collective distillation process to produce rich and reliable signals. This class relevance learning allows ToTER to effectively identify central subjects of a given text under the guidance of the topical taxonomy reflecting user interests.

Based on the identified topic class relevance, ToTER leverages the topical relatedness of a query and documents to complement the semantic matching by PLM-based retrievers. In Table 1(a), we see that ToTER can improve retrieval by identifying common topic classes like “cryptography” and “computer security” for both query and document, given the presence of terms frequently used for these topic classes (C1). Furthermore, ToTER combines the topical relatedness with more fine-grained phrase knowledge for each topic class, hel** to distinguish documents having similar topics. In Table 1(b), ToTER identifies and utilizes core topical phrases such as “dye”, “lasting”, and “permanent hair color” to enrich the query, enabling more accurate finding of relevant documents (C2). This entire process is built upon the topical taxonomy reflecting user-interested aspects (C3). Formally, ToTER introduces three strategies to complement the PLM-based retrieval: (1) search space adjustment, (2) class relevance matching, and (3) query enrichment by core phrases. Our contributions are summarized as follows: