-
On The Persona-based Summarization of Domain-Specific Documents
Authors:
Ankan Mullick,
Sombit Bose,
Rounak Saha,
Ayan Kumar Bhowmick,
Pawan Goyal,
Niloy Ganguly,
Prasenjit Dey,
Ravi Kokku
Abstract:
In an ever-expanding world of domain-specific knowledge, the increasing complexity of consuming, and storing information necessitates the generation of summaries from large information repositories. However, every persona of a domain has different requirements of information and hence their summarization. For example, in the healthcare domain, a persona-based (such as Doctor, Nurse, Patient etc.)…
▽ More
In an ever-expanding world of domain-specific knowledge, the increasing complexity of consuming, and storing information necessitates the generation of summaries from large information repositories. However, every persona of a domain has different requirements of information and hence their summarization. For example, in the healthcare domain, a persona-based (such as Doctor, Nurse, Patient etc.) approach is imperative to deliver targeted medical information efficiently. Persona-based summarization of domain-specific information by humans is a high cognitive load task and is generally not preferred. The summaries generated by two different humans have high variability and do not scale in cost and subject matter expertise as domains and personas grow. Further, AI-generated summaries using generic Large Language Models (LLMs) may not necessarily offer satisfactory accuracy for different domains unless they have been specifically trained on domain-specific data and can also be very expensive to use in day-to-day operations. Our contribution in this paper is two-fold: 1) We present an approach to efficiently fine-tune a domain-specific small foundation LLM using a healthcare corpus and also show that we can effectively evaluate the summarization quality using AI-based critiquing. 2) We further show that AI-based critiquing has good concordance with Human-based critiquing of the summaries. Hence, such AI-based pipelines to generate domain-specific persona-based summaries can be easily scaled to other domains such as legal, enterprise documents, education etc. in a very efficient and cost-effective manner.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
QBER: Quantifying Cyber Risks for Strategic Decisions
Authors:
Muriel Figueredo Franco,
Aiatur Rahaman Mullick,
Santosh Jha
Abstract:
Quantifying cyber risks is essential for organizations to grasp their vulnerability to threats and make informed decisions. However, current approaches still need to work on blending economic viewpoints to provide insightful analysis. To bridge this gap, we introduce QBER approach to offer decision-makers measurable risk metrics. The QBER evaluates losses from cyberattacks, performs detailed risk…
▽ More
Quantifying cyber risks is essential for organizations to grasp their vulnerability to threats and make informed decisions. However, current approaches still need to work on blending economic viewpoints to provide insightful analysis. To bridge this gap, we introduce QBER approach to offer decision-makers measurable risk metrics. The QBER evaluates losses from cyberattacks, performs detailed risk analyses based on existing cybersecurity measures, and provides thorough cost assessments. Our contributions involve outlining cyberattack probabilities and risks, identifying Technical, Economic, and Legal (TEL) impacts, creating a model to gauge impacts, suggesting risk mitigation strategies, and examining trends and challenges in implementing widespread Cyber Risk Quantification (CRQ). The QBER approach serves as a guided approach for organizations to assess risks and strategically invest in cybersecurity.
△ Less
Submitted 6 May, 2024;
originally announced May 2024.
-
Intent Detection and Entity Extraction from BioMedical Literature
Authors:
Ankan Mullick,
Mukur Gupta,
Pawan Goyal
Abstract:
Biomedical queries have become increasingly prevalent in web searches, reflecting the growing interest in accessing biomedical literature. Despite recent research on large-language models (LLMs) motivated by endeavours to attain generalized intelligence, their efficacy in replacing task and domain-specific natural language understanding approaches remains questionable. In this paper, we address th…
▽ More
Biomedical queries have become increasingly prevalent in web searches, reflecting the growing interest in accessing biomedical literature. Despite recent research on large-language models (LLMs) motivated by endeavours to attain generalized intelligence, their efficacy in replacing task and domain-specific natural language understanding approaches remains questionable. In this paper, we address this question by conducting a comprehensive empirical evaluation of intent detection and named entity recognition (NER) tasks from biomedical text. We show that Supervised Fine Tuned approaches are still relevant and more effective than general-purpose LLMs. Biomedical transformer models such as PubMedBERT can surpass ChatGPT on NER task with only 5 supervised examples.
△ Less
Submitted 4 April, 2024;
originally announced April 2024.
-
Long Dialog Summarization: An Analysis
Authors:
Ankan Mullick,
Ayan Kumar Bhowmick,
Raghav R,
Ravi Kokku,
Prasenjit Dey,
Pawan Goyal,
Niloy Ganguly
Abstract:
Dialog summarization has become increasingly important in managing and comprehending large-scale conversations across various domains. This task presents unique challenges in capturing the key points, context, and nuances of multi-turn long conversations for summarization. It is worth noting that the summarization techniques may vary based on specific requirements such as in a shop**-chatbot sce…
▽ More
Dialog summarization has become increasingly important in managing and comprehending large-scale conversations across various domains. This task presents unique challenges in capturing the key points, context, and nuances of multi-turn long conversations for summarization. It is worth noting that the summarization techniques may vary based on specific requirements such as in a shop**-chatbot scenario, the dialog summary helps to learn user preferences, whereas in the case of a customer call center, the summary may involve the problem attributes that a user specified, and the final resolution provided. This work emphasizes the significance of creating coherent and contextually rich summaries for effective communication in various applications. We explore current state-of-the-art approaches for long dialog summarization in different domains and benchmark metrics based evaluations show that one single model does not perform well across various areas for distinct summarization tasks.
△ Less
Submitted 26 February, 2024;
originally announced February 2024.
-
MatSciRE: Leveraging Pointer Networks to Automate Entity and Relation Extraction for Material Science Knowledge-base Construction
Authors:
Ankan Mullick,
Akash Ghosh,
G Sai Chaitanya,
Samir Ghui,
Tapas Nayak,
Seung-Cheol Lee,
Satadeep Bhattacharjee,
Pawan Goyal
Abstract:
Material science literature is a rich source of factual information about various categories of entities (like materials and compositions) and various relations between these entities, such as conductivity, voltage, etc. Automatically extracting this information to generate a material science knowledge base is a challenging task. In this paper, we propose MatSciRE (Material Science Relation Extrac…
▽ More
Material science literature is a rich source of factual information about various categories of entities (like materials and compositions) and various relations between these entities, such as conductivity, voltage, etc. Automatically extracting this information to generate a material science knowledge base is a challenging task. In this paper, we propose MatSciRE (Material Science Relation Extractor), a Pointer Network-based encoder-decoder framework, to jointly extract entities and relations from material science articles as a triplet ($entity1, relation, entity2$). Specifically, we target the battery materials and identify five relations to work on - conductivity, coulombic efficiency, capacity, voltage, and energy. Our proposed approach achieved a much better F1-score (0.771) than a previous attempt using ChemDataExtractor (0.716). The overall graphical framework of MatSciRE is shown in Fig 1. The material information is extracted from material science literature in the form of entity-relation triplets using MatSciRE.
△ Less
Submitted 18 January, 2024;
originally announced January 2024.
-
Novel Intent Detection and Active Learning Based Classification (Student Abstract)
Authors:
Ankan Mullick
Abstract:
Novel intent class detection is an important problem in real world scenario for conversational agents for continuous interaction. Several research works have been done to detect novel intents in a mono-lingual (primarily English) texts and images. But, current systems lack an end-to-end universal framework to detect novel intents across various different languages with less human annotation effort…
▽ More
Novel intent class detection is an important problem in real world scenario for conversational agents for continuous interaction. Several research works have been done to detect novel intents in a mono-lingual (primarily English) texts and images. But, current systems lack an end-to-end universal framework to detect novel intents across various different languages with less human annotation effort for mis-classified and system rejected samples. This paper proposes NIDAL (Novel Intent Detection and Active Learning based classification), a semi-supervised framework to detect novel intents while reducing human annotation cost. Empirical results on various benchmark datasets demonstrate that this system outperforms the baseline methods by more than 10% margin for accuracy and macro-F1. The system achieves this while maintaining overall annotation cost to be just ~6-10% of the unlabeled data available to the system.
△ Less
Submitted 22 February, 2023;
originally announced April 2023.
-
Intent Identification and Entity Extraction for Healthcare Queries in Indic Languages
Authors:
Ankan Mullick,
Ishani Mondal,
Sourjyadip Ray,
R Raghav,
G Sai Chaitanya,
Pawan Goyal
Abstract:
Scarcity of data and technological limitations for resource-poor languages in develo** countries like India poses a threat to the development of sophisticated NLU systems for healthcare. To assess the current status of various state-of-the-art language models in healthcare, this paper studies the problem by initially proposing two different Healthcare datasets, Indian Healthcare Query Intent-Web…
▽ More
Scarcity of data and technological limitations for resource-poor languages in develo** countries like India poses a threat to the development of sophisticated NLU systems for healthcare. To assess the current status of various state-of-the-art language models in healthcare, this paper studies the problem by initially proposing two different Healthcare datasets, Indian Healthcare Query Intent-WebMD and 1mg (IHQID-WebMD and IHQID-1mg) and one real world Indian hospital query data in English and multiple Indic languages (Hindi, Bengali, Tamil, Telugu, Marathi and Gujarati) which are annotated with the query intents as well as entities. Our aim is to detect query intents and extract corresponding entities. We perform extensive experiments on a set of models in various realistic settings and explore two scenarios based on the access to English data only (less costly) and access to target language data (more expensive). We analyze context specific practical relevancy through empirical analysis. The results, expressed in terms of overall F1 score show that our approach is practically useful to identify intents and entities.
△ Less
Submitted 19 February, 2023;
originally announced February 2023.
-
Improving Self-supervised Learning for Out-of-distribution Task via Auxiliary Classifier
Authors:
Harshita Boonlia,
Tanmoy Dam,
Md Meftahul Ferdaus,
Sreenatha G. Anavatti,
Ankan Mullick
Abstract:
In real world scenarios, out-of-distribution (OOD) datasets may have a large distributional shift from training datasets. This phenomena generally occurs when a trained classifier is deployed on varying dynamic environments, which causes a significant drop in performance. To tackle this issue, we are proposing an end-to-end deep multi-task network in this work. Observing a strong relationship betw…
▽ More
In real world scenarios, out-of-distribution (OOD) datasets may have a large distributional shift from training datasets. This phenomena generally occurs when a trained classifier is deployed on varying dynamic environments, which causes a significant drop in performance. To tackle this issue, we are proposing an end-to-end deep multi-task network in this work. Observing a strong relationship between rotation prediction (self-supervised) accuracy and semantic classification accuracy on OOD tasks, we introduce an additional auxiliary classification head in our multi-task network along with semantic classification and rotation prediction head. To observe the influence of this addition classifier in improving the rotation prediction head, our proposed learning method is framed into bi-level optimisation problem where the upper-level is trained to update the parameters for semantic classification and rotation prediction head. In the lower-level optimisation, only the auxiliary classification head is updated through semantic classification head by fixing the parameters of the semantic classification head. The proposed method has been validated through three unseen OOD datasets where it exhibits a clear improvement in semantic classification accuracy than other two baseline methods. Our code is available on GitHub \url{https://github.com/harshita-555/OSSL}
△ Less
Submitted 6 September, 2022;
originally announced September 2022.
-
An Evaluation Framework for Legal Document Summarization
Authors:
Ankan Mullick,
Abhilash Nandy,
Manav Nitin Kapadnis,
Sohan Patnaik,
R Raghav,
Roshni Kar
Abstract:
A law practitioner has to go through numerous lengthy legal case proceedings for their practices of various categories, such as land dispute, corruption, etc. Hence, it is important to summarize these documents, and ensure that summaries contain phrases with intent matching the category of the case. To the best of our knowledge, there is no evaluation metric that evaluates a summary based on its i…
▽ More
A law practitioner has to go through numerous lengthy legal case proceedings for their practices of various categories, such as land dispute, corruption, etc. Hence, it is important to summarize these documents, and ensure that summaries contain phrases with intent matching the category of the case. To the best of our knowledge, there is no evaluation metric that evaluates a summary based on its intent. We propose an automated intent-based summarization metric, which shows a better agreement with human evaluation as compared to other automated metrics like BLEU, ROUGE-L etc. in terms of human satisfaction. We also curate a dataset by annotating intent phrases in legal documents, and show a proof of concept as to how this system can be automated. Additionally, all the code and data to generate reproducible results is available on Github.
△ Less
Submitted 17 May, 2022;
originally announced May 2022.
-
Fine-grained Intent Classification in the Legal Domain
Authors:
Ankan Mullick,
Abhilash Nandy,
Manav Nitin Kapadnis,
Sohan Patnaik,
R Raghav
Abstract:
A law practitioner has to go through a lot of long legal case proceedings. To understand the motivation behind the actions of different parties/individuals in a legal case, it is essential that the parts of the document that express an intent corresponding to the case be clearly understood. In this paper, we introduce a dataset of 93 legal documents, belonging to the case categories of either Murd…
▽ More
A law practitioner has to go through a lot of long legal case proceedings. To understand the motivation behind the actions of different parties/individuals in a legal case, it is essential that the parts of the document that express an intent corresponding to the case be clearly understood. In this paper, we introduce a dataset of 93 legal documents, belonging to the case categories of either Murder, Land Dispute, Robbery, or Corruption, where phrases expressing intent same as the category of the document are annotated. Also, we annotate fine-grained intents for each such phrase to enable a deeper understanding of the case for a reader. Finally, we analyze the performance of several transformer-based models in automating the process of extracting intent phrases (both at a coarse and a fine-grained level), and classifying a document into one of the possible 4 categories, and observe that, our dataset is challenging, especially in the case of fine-grained intent classification.
△ Less
Submitted 6 May, 2022;
originally announced May 2022.
-
A Framework to Generate High-Quality Datapoints for Multiple Novel Intent Detection
Authors:
Ankan Mullick,
Sukannya Purkayastha,
Pawan Goyal,
Niloy Ganguly
Abstract:
Systems like Voice-command based conversational agents are characterized by a pre-defined set of skills or intents to perform user specified tasks. In the course of time, newer intents may emerge requiring retraining. However, the newer intents may not be explicitly announced and need to be inferred dynamically. Thus, there are two important tasks at hand (a). identifying emerging new intents, (b)…
▽ More
Systems like Voice-command based conversational agents are characterized by a pre-defined set of skills or intents to perform user specified tasks. In the course of time, newer intents may emerge requiring retraining. However, the newer intents may not be explicitly announced and need to be inferred dynamically. Thus, there are two important tasks at hand (a). identifying emerging new intents, (b). annotating data of the new intents so that the underlying classifier can be retrained efficiently. The tasks become specially challenging when a large number of new intents emerge simultaneously and there is a limited budget of manual annotation. In this paper, we propose MNID (Multiple Novel Intent Detection) which is a cluster based framework to detect multiple novel intents with budgeted human annotation cost. Empirical results on various benchmark datasets (of different sizes) demonstrate that MNID, by intelligently using the budget for annotation, outperforms the baseline methods in terms of accuracy and F1-score.
△ Less
Submitted 4 May, 2022;
originally announced May 2022.
-
RTE: A Tool for Annotating Relation Triplets from Text
Authors:
Ankan Mullick,
Animesh Bera,
Tapas Nayak
Abstract:
In this work, we present a Web-based annotation tool `Relation Triplets Extractor' \footnote{https://abera87.github.io/annotate/} (RTE) for annotating relation triplets from the text. Relation extraction is an important task for extracting structured information about real-world entities from the unstructured text available on the Web. In relation extraction, we focus on binary relation that refer…
▽ More
In this work, we present a Web-based annotation tool `Relation Triplets Extractor' \footnote{https://abera87.github.io/annotate/} (RTE) for annotating relation triplets from the text. Relation extraction is an important task for extracting structured information about real-world entities from the unstructured text available on the Web. In relation extraction, we focus on binary relation that refers to relations between two entities. Recently, many supervised models are proposed to solve this task, but they mostly use noisy training data obtained using the distant supervision method. In many cases, evaluation of the models is also done based on a noisy test dataset. The lack of annotated clean dataset is a key challenge in this area of research. In this work, we built a web-based tool where researchers can annotate datasets for relation extraction on their own very easily. We use a server-less architecture for this tool, and the entire annotation operation is processed using client-side code. Thus it does not suffer from any network latency, and the privacy of the user's data is also maintained. We hope that this tool will be beneficial for the researchers to advance the field of relation extraction.
△ Less
Submitted 18 August, 2021;
originally announced August 2021.
-
Reproducibility Report: Contextualizing Hate Speech Classifiers with Post-hoc Explanation
Authors:
Kiran Purohit,
Owais Iqbal,
Ankan Mullick
Abstract:
The presented report evaluates Contextualizing Hate Speech Classifiers with Post-hoc Explanation paper within the scope of ML Reproducibility Challenge 2020. Our work focuses on both aspects constituting the paper: the method itself and the validity of the stated results. In the following sections, we have described the paper, related works, algorithmic frameworks, our experiments and evaluations.
The presented report evaluates Contextualizing Hate Speech Classifiers with Post-hoc Explanation paper within the scope of ML Reproducibility Challenge 2020. Our work focuses on both aspects constituting the paper: the method itself and the validity of the stated results. In the following sections, we have described the paper, related works, algorithmic frameworks, our experiments and evaluations.
△ Less
Submitted 24 May, 2021;
originally announced May 2021.
-
MatScIE: An automated tool for the generation of databases of methods and parameters used in the computational materials science literature
Authors:
Souradip Guha,
Ankan Mullick,
Jatin Agrawal,
Swetarekha Ram,
Samir Ghui,
Seung-Cheol Lee,
Satadeep Bhattacharjee,
Pawan Goyal
Abstract:
The number of published articles in the field of materials science is growing rapidly every year. This comparatively unstructured data source, which contains a large amount of information, has a restriction on its re-usability, as the information needed to carry out further calculations using the data in it must be extracted manually. It is very important to obtain valid and contextually correct i…
▽ More
The number of published articles in the field of materials science is growing rapidly every year. This comparatively unstructured data source, which contains a large amount of information, has a restriction on its re-usability, as the information needed to carry out further calculations using the data in it must be extracted manually. It is very important to obtain valid and contextually correct information from the online (offline) data, as it can be useful not only to generate inputs for further calculations, but also to incorporate them into a querying framework. Retaining this context as a priority, we have developed an automated tool, MatScIE (Material Scince Information Extractor) that can extract relevant information from material science literature and make a structured database that is much easier to use for material simulations. Specifically, we extract the material details, methods, code, parameters, and structure from the various research articles. Finally, we created a web application where users can upload published articles and view/download the information obtained from this tool and can create their own databases for their personal uses.
△ Less
Submitted 22 January, 2021; v1 submitted 14 September, 2020;
originally announced September 2020.
-
Public Sphere 2.0: Targeted Commenting in Online News Media
Authors:
Ankan Mullick,
Sayan Ghosh,
Ritam Dutt,
Avijit Ghosh,
Abhijnan Chakraborty
Abstract:
With the increase in online news consumption, to maximize advertisement revenue, news media websites try to attract and retain their readers on their sites. One of the most effective tools for reader engagement is commenting, where news readers post their views as comments against the news articles. Traditionally, it has been assumed that the comments are mostly made against the full article. In t…
▽ More
With the increase in online news consumption, to maximize advertisement revenue, news media websites try to attract and retain their readers on their sites. One of the most effective tools for reader engagement is commenting, where news readers post their views as comments against the news articles. Traditionally, it has been assumed that the comments are mostly made against the full article. In this work, we show that present commenting landscape is far from this assumption. Because the readers lack the time to go over an entire article, most of the comments are relevant to only particular sections of an article. In this paper, we build a system which can automatically classify comments against relevant sections of an article. To implement that, we develop a deep neural network based mechanism to find comments relevant to any section and a paragraph wise commenting interface to showcase them. We believe that such a data driven commenting system can help news websites to further increase reader engagement.
△ Less
Submitted 21 February, 2019;
originally announced February 2019.
-
Understanding Psycholinguistic Behavior of predominant drunk texters in Social Media
Authors:
Suman Kalyan Maity,
Ankan Mullick,
Surjya Ghosh,
Anil Kumar,
Sunny Dhamnani,
Sudhanshu Bahety,
Animesh Mukherjee
Abstract:
In the last decade, social media has evolved as one of the leading platform to create, share, or exchange information; it is commonly used as a way for individuals to maintain social connections. In this online digital world, people use to post texts or pictures to express their views socially and create user-user engagement through discussions and conversations. Thus, social media has established…
▽ More
In the last decade, social media has evolved as one of the leading platform to create, share, or exchange information; it is commonly used as a way for individuals to maintain social connections. In this online digital world, people use to post texts or pictures to express their views socially and create user-user engagement through discussions and conversations. Thus, social media has established itself to bear signals relating to human behavior. One can easily design user characteristic network by scra** through someone's social media profiles. In this paper, we investigate the potential of social media in characterizing and understanding predominant drunk texters from the perspective of their social, psychological and linguistic behavior as evident from the content generated by them. Our research aims to analyze the behavior of drunk texters on social media and to contrast this with non-drunk texters. We use Twitter social media to obtain the set of drunk texters and non-drunk texters and show that we can classify users into these two respective sets using various psycholinguistic features with an overall average accuracy of 96.78% with very high precision and recall. Note that such an automatic classification can have far-reaching impact - (i) on health research related to addiction prevention and control, and (ii) in eliminating abusive and vulgar contents from Twitter, borne by the tweets of drunk texters.
△ Less
Submitted 28 May, 2018;
originally announced May 2018.
-
Understanding Book Popularity on Goodreads
Authors:
Suman Kalyan Maity,
Ayush Kumar,
Ankan Mullick,
Vishnu Choudhary,
Animesh Mukherjee
Abstract:
Goodreads has launched the Readers Choice Awards since 2009 where users are able to nominate/vote books of their choice, released in the given year. In this work, we question if the number of votes that a book would receive (aka the popularity of the book) can be predicted based on the characteristics of various entities on Goodreads. We are successful in predicting the popularity of the books wit…
▽ More
Goodreads has launched the Readers Choice Awards since 2009 where users are able to nominate/vote books of their choice, released in the given year. In this work, we question if the number of votes that a book would receive (aka the popularity of the book) can be predicted based on the characteristics of various entities on Goodreads. We are successful in predicting the popularity of the books with high prediction accuracy (correlation coefficient ~0.61) and low RMSE (~1.25). User engagement and author's prestige are found to be crucial factors for book popularity.
△ Less
Submitted 14 February, 2018;
originally announced February 2018.