Search | arXiv e-print repository

Transformer-based Models for Long-Form Document Matching: Challenges and Empirical Analysis

Authors: Akshita Jha, Adithya Samavedhi, Vineeth Rakesh, Jaideep Chandrashekar, Chandan K. Reddy

Abstract: Recent advances in the area of long document matching have primarily focused on using transformer-based models for long document encoding and matching. There are two primary challenges associated with these models. Firstly, the performance gain provided by transformer-based models comes at a steep cost - both in terms of the required training time and the resource (memory and energy) consumption.… ▽ More Recent advances in the area of long document matching have primarily focused on using transformer-based models for long document encoding and matching. There are two primary challenges associated with these models. Firstly, the performance gain provided by transformer-based models comes at a steep cost - both in terms of the required training time and the resource (memory and energy) consumption. The second major limitation is their inability to handle more than a pre-defined input token length at a time. In this work, we empirically demonstrate the effectiveness of simple neural models (such as feed-forward networks, and CNNs) and simple embeddings (like GloVe, and Paragraph Vector) over transformer-based models on the task of document matching. We show that simple models outperform the more complex BERT-based models while taking significantly less training time, energy, and memory. The simple models are also more robust to variations in document length and text perturbations. △ Less

Submitted 7 February, 2023; originally announced February 2023.

arXiv:2108.09190 [pdf, other]

Supervised Contrastive Learning for Interpretable Long-Form Document Matching

Authors: Akshita Jha, Vineeth Rakesh, Jaideep Chandrashekar, Adithya Samavedhi, Chandan K. Reddy

Abstract: Recent advancements in deep learning techniques have transformed the area of semantic text matching. However, most state-of-the-art models are designed to operate with short documents such as tweets, user reviews, comments, etc. These models have fundamental limitations when applied to long-form documents such as scientific papers, legal documents, and patents. When handling such long documents, t… ▽ More Recent advancements in deep learning techniques have transformed the area of semantic text matching. However, most state-of-the-art models are designed to operate with short documents such as tweets, user reviews, comments, etc. These models have fundamental limitations when applied to long-form documents such as scientific papers, legal documents, and patents. When handling such long documents, there are three primary challenges: (i) the presence of different contexts for the same word throughout the document, (ii) small sections of contextually similar text between two documents, but dissimilar text in the remaining parts (this defies the basic understanding of "similarity"), and (iii) the coarse nature of a single global similarity measure which fails to capture the heterogeneity of the document content. In this paper, we describe CoLDE: Contrastive Long Document Encoder - a transformer-based framework that addresses these challenges and allows for interpretable comparisons of long documents. CoLDE uses unique positional embeddings and a multi-headed chunkwise attention layer in conjunction with a supervised contrastive learning framework to capture similarity at three different levels: (i) high-level similarity scores between a pair of documents, (ii) similarity scores between different sections within and across documents, and (iii) similarity scores between different chunks in the same document and across other documents. These fine-grained similarity scores aid in better interpretability. We evaluate CoLDE on three long document datasets namely, ACL Anthology publications, Wikipedia articles, and USPTO patents. Besides outperforming the state-of-the-art methods on the document matching task, CoLDE is also robust to changes in document length and text perturbations and provides interpretable results. △ Less

Submitted 2 June, 2022; v1 submitted 20 August, 2021; originally announced August 2021.

arXiv:1709.00348 [pdf, other]

Inferring Networked Device Categories from Low-Level Activity Indicators

Authors: Kyumars Sheykh Esmaili, Jaideep Chandrashekar, Pascal Le Guyadec

Abstract: We study the problem of inferring the type of a networked device in a home network by leveraging low level traffic activity indicators seen at commodity home gateways. We analyze a dataset of detailed device network activity obtained from 240 subscriber homes of a large European ISP and extract a number of traffic and spatial fingerprints for individual devices. We develop a two level taxonomy to… ▽ More We study the problem of inferring the type of a networked device in a home network by leveraging low level traffic activity indicators seen at commodity home gateways. We analyze a dataset of detailed device network activity obtained from 240 subscriber homes of a large European ISP and extract a number of traffic and spatial fingerprints for individual devices. We develop a two level taxonomy to describe devices onto which we map individual devices using a number of heuristics. We leverage the heuristically derived labels to train classifiers that distinguish device classes based on the traffic and spatial fingerprints of a device. Our results show an accuracy level up to 91% for the coarse level category and up to 84% for the fine grained category. By incorporating information from other sources (e.g., MAC OUI), we are able to further improve accuracy to above 97% and 92%, respectively. Finally, we also extract a set of simple and human-readable rules that concisely capture the behaviour of these distinct device categories. △ Less

Submitted 1 September, 2017; originally announced September 2017.

Comments: 14 pages, 9 figures, 7 tables

arXiv:1504.06093 [pdf, other]

Taming the Android AppStore: Lightweight Characterization of Android Applications

Authors: Luigi Vigneri, Jaideep Chandrashekar, Ioannis Pefkianakis, Olivier Heen

Abstract: There are over 1.2 million applications on the Google Play store today with a large number of competing applications for any given use or function. This creates challenges for users in selecting the right application. Moreover, some of the applications being of dubious origin, there are no mechanisms for users to understand who the applications are talking to, and to what extent. In our work, we f… ▽ More There are over 1.2 million applications on the Google Play store today with a large number of competing applications for any given use or function. This creates challenges for users in selecting the right application. Moreover, some of the applications being of dubious origin, there are no mechanisms for users to understand who the applications are talking to, and to what extent. In our work, we first develop a lightweight characterization methodology that can automatically extract descriptions of application network behavior, and apply this to a large selection of applications from the Google App Store. We find several instances of overly aggressive communication with tracking websites, of excessive communication with ad related sites, and of communication with sites previously associated with malware activity. Our results underscore the need for a tool to provide users more visibility into the communication of apps installed on their mobile devices. To this end, we develop an Android application to do just this; our application monitors outgoing traffic, associates it with particular applications, and then identifies destinations in particular categories that we believe suspicious or else important to reveal to the end-user. △ Less

Submitted 27 April, 2015; v1 submitted 23 April, 2015; originally announced April 2015.

Comments: 20 pages, single column

Report number: RR-15-305

arXiv:1212.2744 [pdf, other]

Mixture Models of Endhost Network Traffic

Authors: John Mark Agosta, Jaideep Chandrashekar, Mark Crovella, Nina Taft, Daniel Ting

Abstract: In this work we focus on modeling a little studied type of traffic, namely the network traffic generated from endhosts. We introduce a parsimonious parametric model of the marginal distribution for connection arrivals. We employ mixture models based on a convex combination of component distributions with both heavy and light-tails. These models can be fitted with high accuracy using maximum likeli… ▽ More In this work we focus on modeling a little studied type of traffic, namely the network traffic generated from endhosts. We introduce a parsimonious parametric model of the marginal distribution for connection arrivals. We employ mixture models based on a convex combination of component distributions with both heavy and light-tails. These models can be fitted with high accuracy using maximum likelihood techniques. Our methodology assumes that the underlying user data can be fitted to one of many modeling options, and we apply Bayesian model selection criteria as a rigorous way to choose the preferred combination of components. Our experiments show that a simple Pareto-exponential mixture model is preferred for a wide range of users, over both simpler and more complex alternatives. This model has the desirable property of modeling the entire distribution, effectively segmenting the traffic into the heavy-tailed as well as the non-heavy-tailed components. We illustrate that this technique has the flexibility to capture the wide diversity of user behaviors. △ Less

Submitted 12 December, 2012; originally announced December 2012.

Showing 1–5 of 5 results for author: Chandrashekar, J