-
Factual Dialogue Summarization via Learning from Large Language Models
Authors:
Rongxin Zhu,
Jey Han Lau,
Jianzhong Qi
Abstract:
Factual consistency is an important quality in dialogue summarization. Large language model (LLM)-based automatic text summarization models generate more factually consistent summaries compared to those by smaller pretrained language models, but they face deployment challenges in real-world applications due to privacy or resource constraints. In this paper, we investigate the use of symbolic knowl…
▽ More
Factual consistency is an important quality in dialogue summarization. Large language model (LLM)-based automatic text summarization models generate more factually consistent summaries compared to those by smaller pretrained language models, but they face deployment challenges in real-world applications due to privacy or resource constraints. In this paper, we investigate the use of symbolic knowledge distillation to improve the factual consistency of smaller pretrained models for dialogue summarization. We employ zero-shot learning to extract symbolic knowledge from LLMs, generating both factually consistent (positive) and inconsistent (negative) summaries. We then apply two contrastive learning objectives on these summaries to enhance smaller summarization models. Experiments with BART, PEGASUS, and Flan-T5 indicate that our approach surpasses strong baselines that rely on complex data augmentation strategies. Our approach achieves better factual consistency while maintaining coherence, fluency, and relevance, as confirmed by various automatic evaluation metrics. We also provide access to the data and code to facilitate future research.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Evaluating Transparency of Machine Generated Fact Checking Explanations
Authors:
Rui Xing,
Timothy Baldwin,
Jey Han Lau
Abstract:
An important factor when it comes to generating fact-checking explanations is the selection of evidence: intuitively, high-quality explanations can only be generated given the right evidence. In this work, we investigate the impact of human-curated vs. machine-selected evidence for explanation generation using large language models. To assess the quality of explanations, we focus on transparency (…
▽ More
An important factor when it comes to generating fact-checking explanations is the selection of evidence: intuitively, high-quality explanations can only be generated given the right evidence. In this work, we investigate the impact of human-curated vs. machine-selected evidence for explanation generation using large language models. To assess the quality of explanations, we focus on transparency (whether an explanation cites sources properly) and utility (whether an explanation is helpful in clarifying a claim). Surprisingly, we found that large language models generate similar or higher quality explanations using machine-selected evidence, suggesting carefully curated evidence (by humans) may not be necessary. That said, even with the best model, the generated explanations are not always faithful to the sources, suggesting further room for improvement in explanation generation for fact-checking.
△ Less
Submitted 18 June, 2024;
originally announced June 2024.
-
A Sentiment Consolidation Framework for Meta-Review Generation
Authors:
Miao Li,
Jey Han Lau,
Eduard Hovy
Abstract:
Modern natural language generation systems with Large Language Models (LLMs) exhibit the capability to generate a plausible summary of multiple documents; however, it is uncertain if they truly possess the capability of information consolidation to generate summaries, especially on documents with opinionated information. We focus on meta-review generation, a form of sentiment summarisation for the…
▽ More
Modern natural language generation systems with Large Language Models (LLMs) exhibit the capability to generate a plausible summary of multiple documents; however, it is uncertain if they truly possess the capability of information consolidation to generate summaries, especially on documents with opinionated information. We focus on meta-review generation, a form of sentiment summarisation for the scientific domain. To make scientific sentiment summarization more grounded, we hypothesize that human meta-reviewers follow a three-layer framework of sentiment consolidation to write meta-reviews. Based on the framework, we propose novel prompting methods for LLMs to generate meta-reviews and evaluation metrics to assess the quality of generated meta-reviews. Our framework is validated empirically as we find that prompting LLMs based on the framework -- compared with prompting them with simple instructions -- generates better meta-reviews.
△ Less
Submitted 4 June, 2024; v1 submitted 27 February, 2024;
originally announced February 2024.
-
CMA-R:Causal Mediation Analysis for Explaining Rumour Detection
Authors:
Lin Tian,
Xiuzhen Zhang,
Jey Han Lau
Abstract:
We apply causal mediation analysis to explain the decision-making process of neural models for rumour detection on Twitter. Interventions at the input and network level reveal the causal impacts of tweets and words in the model output. We find that our approach CMA-R -- Causal Mediation Analysis for Rumour detection -- identifies salient tweets that explain model predictions and show strong agreem…
▽ More
We apply causal mediation analysis to explain the decision-making process of neural models for rumour detection on Twitter. Interventions at the input and network level reveal the causal impacts of tweets and words in the model output. We find that our approach CMA-R -- Causal Mediation Analysis for Rumour detection -- identifies salient tweets that explain model predictions and show strong agreement with human judgements for critical tweets determining the truthfulness of stories. CMA-R can further highlight causally impactful words in the salient tweets, providing another layer of interpretability and transparency into these blackbox rumour detection systems. Code is available at: https://github.com/ltian678/cma-r.
△ Less
Submitted 12 February, 2024;
originally announced February 2024.
-
Unsupervised Lexical Simplification with Context Augmentation
Authors:
Takashi Wada,
Timothy Baldwin,
Jey Han Lau
Abstract:
We propose a new unsupervised lexical simplification method that uses only monolingual data and pre-trained language models. Given a target word and its context, our method generates substitutes based on the target context and also additional contexts sampled from monolingual data. We conduct experiments in English, Portuguese, and Spanish on the TSAR-2022 shared task, and show that our model subs…
▽ More
We propose a new unsupervised lexical simplification method that uses only monolingual data and pre-trained language models. Given a target word and its context, our method generates substitutes based on the target context and also additional contexts sampled from monolingual data. We conduct experiments in English, Portuguese, and Spanish on the TSAR-2022 shared task, and show that our model substantially outperforms other unsupervised systems across all languages. We also establish a new state-of-the-art by ensembling our model with GPT-3.5. Lastly, we evaluate our model on the SWORDS lexical substitution data set, achieving a state-of-the-art result.
△ Less
Submitted 1 November, 2023;
originally announced November 2023.
-
Unsupervised Paraphrasing of Multiword Expressions
Authors:
Takashi Wada,
Yuji Matsumoto,
Timothy Baldwin,
Jey Han Lau
Abstract:
We propose an unsupervised approach to paraphrasing multiword expressions (MWEs) in context. Our model employs only monolingual corpus data and pre-trained language models (without fine-tuning), and does not make use of any external resources such as dictionaries. We evaluate our method on the SemEval 2022 idiomatic semantic text similarity task, and show that it outperforms all unsupervised syste…
▽ More
We propose an unsupervised approach to paraphrasing multiword expressions (MWEs) in context. Our model employs only monolingual corpus data and pre-trained language models (without fine-tuning), and does not make use of any external resources such as dictionaries. We evaluate our method on the SemEval 2022 idiomatic semantic text similarity task, and show that it outperforms all unsupervised systems and rivals supervised systems.
△ Less
Submitted 2 June, 2023;
originally announced June 2023.
-
Annotating and Detecting Fine-grained Factual Errors for Dialogue Summarization
Authors:
Rongxin Zhu,
Jianzhong Qi,
Jey Han Lau
Abstract:
A series of datasets and models have been proposed for summaries generated for well-formatted documents such as news articles. Dialogue summaries, however, have been under explored. In this paper, we present the first dataset with fine-grained factual error annotations named DIASUMFACT. We define fine-grained factual error detection as a sentence-level multi-label classification problem, and we ev…
▽ More
A series of datasets and models have been proposed for summaries generated for well-formatted documents such as news articles. Dialogue summaries, however, have been under explored. In this paper, we present the first dataset with fine-grained factual error annotations named DIASUMFACT. We define fine-grained factual error detection as a sentence-level multi-label classification problem, and we evaluate two state-of-the-art (SOTA) models on our dataset. Both models yield sub-optimal results, with a macro-averaged F1 score of around 0.25 over 6 error classes. We further propose an unsupervised model ENDERANKER via candidate ranking using pretrained encoder-decoder models. Our model performs on par with the SOTA models while requiring fewer resources. These observations confirm the challenges in detecting factual errors from dialogue summaries, which call for further studies, for which our dataset and results offer a solid foundation.
△ Less
Submitted 25 May, 2023;
originally announced May 2023.
-
Summarizing Multiple Documents with Conversational Structure for Meta-Review Generation
Authors:
Miao Li,
Eduard Hovy,
Jey Han Lau
Abstract:
We present PeerSum, a novel dataset for generating meta-reviews of scientific papers. The meta-reviews can be interpreted as abstractive summaries of reviews, multi-turn discussions and the paper abstract. These source documents have rich inter-document relationships with an explicit hierarchical conversational structure, cross-references and (occasionally) conflicting information. To introduce th…
▽ More
We present PeerSum, a novel dataset for generating meta-reviews of scientific papers. The meta-reviews can be interpreted as abstractive summaries of reviews, multi-turn discussions and the paper abstract. These source documents have rich inter-document relationships with an explicit hierarchical conversational structure, cross-references and (occasionally) conflicting information. To introduce the structural inductive bias into pre-trained language models, we introduce Rammer ( Relationship-aware Multi-task Meta-review Generator), a model that uses sparse attention based on the conversational structure and a multi-task training objective that predicts metadata features (e.g., review ratings). Our experimental results show that Rammer outperforms other strong baseline models in terms of a suite of automatic evaluation metrics. Further analyses, however, reveal that RAMMER and other models struggle to handle conflicts in source documents of PeerSum, suggesting meta-review generation is a challenging task and a promising avenue for further research.
△ Less
Submitted 23 October, 2023; v1 submitted 2 May, 2023;
originally announced May 2023.
-
DeltaScore: Fine-Grained Story Evaluation with Perturbations
Authors:
Zhuohan Xie,
Miao Li,
Trevor Cohn,
Jey Han Lau
Abstract:
Numerous evaluation metrics have been developed for natural language generation tasks, but their effectiveness in evaluating stories is limited as they are not specifically tailored to assess intricate aspects of storytelling, such as fluency and interestingness. In this paper, we introduce DELTASCORE, a novel methodology that employs perturbation techniques for the evaluation of nuanced story asp…
▽ More
Numerous evaluation metrics have been developed for natural language generation tasks, but their effectiveness in evaluating stories is limited as they are not specifically tailored to assess intricate aspects of storytelling, such as fluency and interestingness. In this paper, we introduce DELTASCORE, a novel methodology that employs perturbation techniques for the evaluation of nuanced story aspects. Our central proposition posits that the extent to which a story excels in a specific aspect (e.g., fluency) correlates with the magnitude of its susceptibility to particular perturbations (e.g., the introduction of typos). Given this, we measure the quality of an aspect by calculating the likelihood difference between pre- and post-perturbation states using pre-trained language models. We compare DELTASCORE with existing metrics on storytelling datasets from two domains in five fine-grained story aspects: fluency, coherence, relatedness, logicality, and interestingness. DELTASCORE demonstrates remarkable performance, revealing a surprising finding that a specific perturbation proves highly effective in capturing multiple aspects.
△ Less
Submitted 2 November, 2023; v1 submitted 15 March, 2023;
originally announced March 2023.
-
MetaTroll: Few-shot Detection of State-Sponsored Trolls with Transformer Adapters
Authors:
Lin Tian,
Xiuzhen Zhang,
Jey Han Lau
Abstract:
State-sponsored trolls are the main actors of influence campaigns on social media and automatic troll detection is important to combat misinformation at scale. Existing troll detection models are developed based on training data for known campaigns (e.g.\ the influence campaign by Russia's Internet Research Agency on the 2016 US Election), and they fall short when dealing with {\em novel} campaign…
▽ More
State-sponsored trolls are the main actors of influence campaigns on social media and automatic troll detection is important to combat misinformation at scale. Existing troll detection models are developed based on training data for known campaigns (e.g.\ the influence campaign by Russia's Internet Research Agency on the 2016 US Election), and they fall short when dealing with {\em novel} campaigns with new targets. We propose MetaTroll, a text-based troll detection model based on the meta-learning framework that enables high portability and parameter-efficient adaptation to new campaigns using only a handful of labelled samples for few-shot transfer. We introduce \textit{campaign-specific} transformer adapters to MetaTroll to ``memorise'' campaign-specific knowledge so as to tackle catastrophic forgetting, where a model ``forgets'' how to detect trolls from older campaigns due to continual adaptation. Our experiments demonstrate that MetaTroll substantially outperforms baselines and state-of-the-art few-shot text classification models. Lastly, we explore simple approaches to extend MetaTroll to multilingual and multimodal detection. Source code for MetaTroll is available at: https://github.com/ltian678/metatroll-code.git.
△ Less
Submitted 13 March, 2023;
originally announced March 2023.
-
Compressed Heterogeneous Graph for Abstractive Multi-Document Summarization
Authors:
Miao Li,
Jianzhong Qi,
Jey Han Lau
Abstract:
Multi-document summarization (MDS) aims to generate a summary for a number of related documents. We propose HGSUM, an MDS model that extends an encoder-decoder architecture, to incorporate a heterogeneous graph to represent different semantic units (e.g., words and sentences) of the documents. This contrasts with existing MDS models which do not consider different edge types of graphs and as such…
▽ More
Multi-document summarization (MDS) aims to generate a summary for a number of related documents. We propose HGSUM, an MDS model that extends an encoder-decoder architecture, to incorporate a heterogeneous graph to represent different semantic units (e.g., words and sentences) of the documents. This contrasts with existing MDS models which do not consider different edge types of graphs and as such do not capture the diversity of relationships in the documents. To preserve only key information and relationships of the documents in the heterogeneous graph, HGSUM uses graph pooling to compress the input graph. And to guide HGSUM to learn compression, we introduce an additional objective that maximizes the similarity between the compressed graph and the graph constructed from the ground-truth summary during training. HGSUM is trained end-to-end with graph similarity and standard cross-entropy objectives. Experimental results over MULTI-NEWS, WCEP-100, and ARXIV show that HGSUM outperforms state-of-the-art MDS models. The code for our model and experiments is available at: https://github.com/oaimli/HGSum.
△ Less
Submitted 11 March, 2023;
originally announced March 2023.
-
The Next Chapter: A Study of Large Language Models in Storytelling
Authors:
Zhuohan Xie,
Trevor Cohn,
Jey Han Lau
Abstract:
To enhance the quality of generated stories, recent story generation models have been investigating the utilization of higher-level attributes like plots or commonsense knowledge. The application of prompt-based learning with large language models (LLMs), exemplified by GPT-3, has exhibited remarkable performance in diverse natural language processing (NLP) tasks. This paper conducts a comprehensi…
▽ More
To enhance the quality of generated stories, recent story generation models have been investigating the utilization of higher-level attributes like plots or commonsense knowledge. The application of prompt-based learning with large language models (LLMs), exemplified by GPT-3, has exhibited remarkable performance in diverse natural language processing (NLP) tasks. This paper conducts a comprehensive investigation, utilizing both automatic and human evaluation, to compare the story generation capacity of LLMs with recent models across three datasets with variations in style, register, and length of stories. The results demonstrate that LLMs generate stories of significantly higher quality compared to other story generation models. Moreover, they exhibit a level of performance that competes with human authors, albeit with the preliminary observation that they tend to replicate real stories in situations involving world knowledge, resembling a form of plagiarism.
△ Less
Submitted 24 July, 2023; v1 submitted 23 January, 2023;
originally announced January 2023.
-
Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal Negation
Authors:
Thinh Hung Truong,
Yulia Otmakhova,
Timothy Baldwin,
Trevor Cohn,
Jey Han Lau,
Karin Verspoor
Abstract:
Negation is poorly captured by current language models, although the extent of this problem is not widely understood. We introduce a natural language inference (NLI) test suite to enable probing the capabilities of NLP methods, with the aim of understanding sub-clausal negation. The test suite contains premise--hypothesis pairs where the premise contains sub-clausal negation and the hypothesis is…
▽ More
Negation is poorly captured by current language models, although the extent of this problem is not widely understood. We introduce a natural language inference (NLI) test suite to enable probing the capabilities of NLP methods, with the aim of understanding sub-clausal negation. The test suite contains premise--hypothesis pairs where the premise contains sub-clausal negation and the hypothesis is constructed by making minimal modifications to the premise in order to reflect different possible interpretations. Aside from adopting standard NLI labels, our test suite is systematically constructed under a rigorous linguistic framework. It includes annotation of negation types and constructions grounded in linguistic theory, as well as the operations used to construct hypotheses. This facilitates fine-grained analysis of model performance. We conduct experiments using pre-trained language models to demonstrate that our test suite is more challenging than existing benchmarks focused on negation, and show how our annotation supports a deeper understanding of the current NLI capabilities in terms of negation and quantification.
△ Less
Submitted 13 October, 2022; v1 submitted 6 October, 2022;
originally announced October 2022.
-
Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective
Authors:
Zijian Zhang,
Chang Shu,
Ya Xiao,
Yuan Shen,
Di Zhu,
**g Xiao,
Youxin Chen,
Jey Han Lau,
Qian Zhang,
Zheng Lu
Abstract:
Visual-Semantic Embedding (VSE) aims to learn an embedding space where related visual and semantic instances are close to each other. Recent VSE models tend to design complex structures to pool visual and semantic features into fixed-length vectors and use hard triplet loss for optimization. However, we find that: (1) combining simple pooling methods is no worse than these sophisticated methods; a…
▽ More
Visual-Semantic Embedding (VSE) aims to learn an embedding space where related visual and semantic instances are close to each other. Recent VSE models tend to design complex structures to pool visual and semantic features into fixed-length vectors and use hard triplet loss for optimization. However, we find that: (1) combining simple pooling methods is no worse than these sophisticated methods; and (2) only considering the most difficult-to-distinguish negative sample leads to slow convergence and poor Recall@K improvement. To this end, we propose an adaptive pooling strategy that allows the model to learn how to aggregate features through a combination of simple pooling methods. We also introduce a strategy to dynamically select a group of negative samples to make the optimization converge faster and perform better. Experimental results on Flickr30K and MS-COCO demonstrate that a standard VSE using our pooling and optimization strategies outperforms current state-of-the-art systems (at least 1.0% on the metrics of recall) in image-to-text and text-to-image retrieval. Source code of our experiments is available at https://github.com/96-Zachary/vse_2ad.
△ Less
Submitted 5 October, 2022;
originally announced October 2022.
-
LED down the rabbit hole: exploring the potential of global attention for biomedical multi-document summarisation
Authors:
Yulia Otmakhova,
Hung Thinh Truong,
Timothy Baldwin,
Trevor Cohn,
Karin Verspoor,
Jey Han Lau
Abstract:
In this paper we report on our submission to the Multidocument Summarisation for Literature Review (MSLR) shared task. Specifically, we adapt PRIMERA (Xiao et al., 2022) to the biomedical domain by placing global attention on important biomedical entities in several ways. We analyse the outputs of the 23 resulting models, and report patterns in the results related to the presence of additional glo…
▽ More
In this paper we report on our submission to the Multidocument Summarisation for Literature Review (MSLR) shared task. Specifically, we adapt PRIMERA (Xiao et al., 2022) to the biomedical domain by placing global attention on important biomedical entities in several ways. We analyse the outputs of the 23 resulting models, and report patterns in the results related to the presence of additional global attention, number of training steps, and the input configuration.
△ Less
Submitted 18 September, 2022;
originally announced September 2022.
-
Unsupervised Lexical Substitution with Decontextualised Embeddings
Authors:
Takashi Wada,
Timothy Baldwin,
Yuji Matsumoto,
Jey Han Lau
Abstract:
We propose a new unsupervised method for lexical substitution using pre-trained language models. Compared to previous approaches that use the generative capability of language models to predict substitutes, our method retrieves substitutes based on the similarity of contextualised and decontextualised word embeddings, i.e. the average contextual representation of a word in multiple contexts. We co…
▽ More
We propose a new unsupervised method for lexical substitution using pre-trained language models. Compared to previous approaches that use the generative capability of language models to predict substitutes, our method retrieves substitutes based on the similarity of contextualised and decontextualised word embeddings, i.e. the average contextual representation of a word in multiple contexts. We conduct experiments in English and Italian, and show that our method substantially outperforms strong baselines and establishes a new state-of-the-art without any explicit supervision or fine-tuning. We further show that our method performs particularly well at predicting low-frequency substitutes, and also generates a diverse list of substitute candidates, reducing morphophonetic or morphosyntactic biases induced by article-noun agreement.
△ Less
Submitted 16 September, 2022;
originally announced September 2022.
-
NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages
Authors:
Genta Indra Winata,
Alham Fikri Aji,
Samuel Cahyawijaya,
Rahmad Mahendra,
Fajri Koto,
Ade Romadhony,
Kemal Kurniawan,
David Moeljadi,
Radityo Eko Prasojo,
Pascale Fung,
Timothy Baldwin,
Jey Han Lau,
Rico Sennrich,
Sebastian Ruder
Abstract:
Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Chinese, while it remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on develo** re…
▽ More
Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Chinese, while it remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on develo** resources for languages in Indonesia. Despite being the second most linguistically diverse country, most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia. Our resource includes datasets, a multi-task benchmark, and lexicons, as well as a parallel Indonesian-English dataset. We provide extensive analyses and describe the challenges when creating such resources. We hope that our work can spark NLP research on Indonesian and other underrepresented languages.
△ Less
Submitted 12 April, 2023; v1 submitted 31 May, 2022;
originally announced May 2022.
-
Robust Task-Oriented Dialogue Generation with Contrastive Pre-training and Adversarial Filtering
Authors:
Shiquan Yang,
Xinting Huang,
Jey Han Lau,
Sarah Erfani
Abstract:
Data artifacts incentivize machine learning models to learn non-transferable generalizations by taking advantage of shortcuts in the data, and there is growing evidence that data artifacts play a role for the strong results that deep learning models achieve in recent natural language processing benchmarks. In this paper, we focus on task-oriented dialogue and investigate whether popular datasets s…
▽ More
Data artifacts incentivize machine learning models to learn non-transferable generalizations by taking advantage of shortcuts in the data, and there is growing evidence that data artifacts play a role for the strong results that deep learning models achieve in recent natural language processing benchmarks. In this paper, we focus on task-oriented dialogue and investigate whether popular datasets such as MultiWOZ contain such data artifacts. We found that by only kee** frequent phrases in the training examples, state-of-the-art models perform similarly compared to the variant trained with full data, suggesting they exploit these spurious correlations to solve the task. Motivated by this, we propose a contrastive learning based framework to encourage the model to ignore these cues and focus on learning generalisable patterns. We also experiment with adversarial filtering to remove "easy" training instances so that the model would focus on learning from the "harder" instances. We conduct a number of generalization experiments -- e.g., cross-domain/dataset and adversarial tests -- to assess the robustness of our approach and found that it works exceptionally well.
△ Less
Submitted 19 May, 2022;
originally announced May 2022.
-
One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia
Authors:
Alham Fikri Aji,
Genta Indra Winata,
Fajri Koto,
Samuel Cahyawijaya,
Ade Romadhony,
Rahmad Mahendra,
Kemal Kurniawan,
David Moeljadi,
Radityo Eko Prasojo,
Timothy Baldwin,
Jey Han Lau,
Sebastian Ruder
Abstract:
NLP research is impeded by a lack of resources and awareness of the challenges presented by underrepresented languages and dialects. Focusing on the languages spoken in Indonesia, the second most linguistically diverse and the fourth most populous nation of the world, we provide an overview of the current state of NLP research for Indonesia's 700+ languages. We highlight challenges in Indonesian N…
▽ More
NLP research is impeded by a lack of resources and awareness of the challenges presented by underrepresented languages and dialects. Focusing on the languages spoken in Indonesia, the second most linguistically diverse and the fourth most populous nation of the world, we provide an overview of the current state of NLP research for Indonesia's 700+ languages. We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems. Finally, we provide general recommendations to help develop NLP technology not only for languages of Indonesia but also other underrepresented languages.
△ Less
Submitted 24 March, 2022;
originally announced March 2022.
-
An Interpretable Neuro-Symbolic Reasoning Framework for Task-Oriented Dialogue Generation
Authors:
Shiquan Yang,
Rui Zhang,
Sarah Erfani,
Jey Han Lau
Abstract:
We study the interpretability issue of task-oriented dialogue systems in this paper. Previously, most neural-based task-oriented dialogue systems employ an implicit reasoning strategy that makes the model predictions uninterpretable to humans. To obtain a transparent reasoning process, we introduce neuro-symbolic to perform explicit reasoning that justifies model decisions by reasoning chains. Sin…
▽ More
We study the interpretability issue of task-oriented dialogue systems in this paper. Previously, most neural-based task-oriented dialogue systems employ an implicit reasoning strategy that makes the model predictions uninterpretable to humans. To obtain a transparent reasoning process, we introduce neuro-symbolic to perform explicit reasoning that justifies model decisions by reasoning chains. Since deriving reasoning chains requires multi-hop reasoning for task-oriented dialogues, existing neuro-symbolic approaches would induce error propagation due to the one-phase design. To overcome this, we propose a two-phase approach that consists of a hypothesis generator and a reasoner. We first obtain multiple hypotheses, i.e., potential operations to perform the desired task, through the hypothesis generator. Each hypothesis is then verified by the reasoner, and the valid one is selected to conduct the final prediction. The whole system is trained by exploiting raw textual dialogues without using any reasoning chain annotations. Experimental studies on two public benchmark datasets demonstrate that the proposed approach not only achieves better results, but also introduces an interpretable decision process.
△ Less
Submitted 11 March, 2022;
originally announced March 2022.
-
PeerSum: A Peer Review Dataset for Abstractive Multi-document Summarization
Authors:
Miao Li,
Jianzhong Qi,
Jey Han Lau
Abstract:
We present PeerSum, a new MDS dataset using peer reviews of scientific publications. Our dataset differs from the existing MDS datasets in that our summaries (i.e., the meta-reviews) are highly abstractive and they are real summaries of the source documents (i.e., the reviews) and it also features disagreements among source documents. We found that current state-of-the-art MDS models struggle to g…
▽ More
We present PeerSum, a new MDS dataset using peer reviews of scientific publications. Our dataset differs from the existing MDS datasets in that our summaries (i.e., the meta-reviews) are highly abstractive and they are real summaries of the source documents (i.e., the reviews) and it also features disagreements among source documents. We found that current state-of-the-art MDS models struggle to generate high-quality summaries for PeerSum, offering new research opportunities.
△ Less
Submitted 28 September, 2022; v1 submitted 3 March, 2022;
originally announced March 2022.
-
ITTC @ TREC 2021 Clinical Trials Track
Authors:
Thinh Hung Truong,
Yulia Otmakhova,
Rahmad Mahendra,
Timothy Baldwin,
Jey Han Lau,
Trevor Cohn,
Lawrence Cavedon,
Damiano Spina,
Karin Verspoor
Abstract:
This paper describes the submissions of the Natural Language Processing (NLP) team from the Australian Research Council Industrial Transformation Training Centre (ITTC) for Cognitive Computing in Medical Technologies to the TREC 2021 Clinical Trials Track. The task focuses on the problem of matching eligible clinical trials to topics constituting a summary of a patient's admission notes. We explor…
▽ More
This paper describes the submissions of the Natural Language Processing (NLP) team from the Australian Research Council Industrial Transformation Training Centre (ITTC) for Cognitive Computing in Medical Technologies to the TREC 2021 Clinical Trials Track. The task focuses on the problem of matching eligible clinical trials to topics constituting a summary of a patient's admission notes. We explore different ways of representing trials and topics using NLP techniques, and then use a common retrieval model to generate the ranked list of relevant trials for each topic. The results from all our submitted runs are well above the median scores for all topics, but there is still plenty of scope for improvement.
△ Less
Submitted 15 February, 2022;
originally announced February 2022.
-
Findings on Conversation Disentanglement
Authors:
Rongxin Zhu,
Jey Han Lau,
Jianzhong Qi
Abstract:
Conversation disentanglement, the task to identify separate threads in conversations, is an important pre-processing step in multi-party conversational NLP applications such as conversational question answering and conversation summarization. Framing it as a utterance-to-utterance classification problem -- i.e. given an utterance of interest (UOI), find which past utterance it replies to -- we exp…
▽ More
Conversation disentanglement, the task to identify separate threads in conversations, is an important pre-processing step in multi-party conversational NLP applications such as conversational question answering and conversation summarization. Framing it as a utterance-to-utterance classification problem -- i.e. given an utterance of interest (UOI), find which past utterance it replies to -- we explore a number of transformer-based models and found that BERT in combination with handcrafted features remains a strong baseline. We then build a multi-task learning model that jointly learns utterance-to-utterance and utterance-to-thread classification. Observing that the ground truth label (past utterance) is in the top candidates when our model makes an error, we experiment with using bipartite graphs as a post-processing step to learn how to best match a set of UOIs to past utterances. Experiments on the Ubuntu IRC dataset show that this approach has the potential to outperform the conventional greedy approach of simply selecting the highest probability candidate for each UOI independently, indicating a promising future research direction.
△ Less
Submitted 10 December, 2021;
originally announced December 2021.
-
Exploring Story Generation with Multi-task Objectives in Variational Autoencoders
Authors:
Zhuohan Xie,
Trevor Cohn,
Jey Han Lau
Abstract:
GPT-2 has been frequently adapted in story generation models as it provides powerful generative capability. However, it still fails to generate consistent stories and lacks diversity. Current story generation models leverage additional information such as plots or commonsense into GPT-2 to guide the generation process. These approaches focus on improving generation quality of stories while our wor…
▽ More
GPT-2 has been frequently adapted in story generation models as it provides powerful generative capability. However, it still fails to generate consistent stories and lacks diversity. Current story generation models leverage additional information such as plots or commonsense into GPT-2 to guide the generation process. These approaches focus on improving generation quality of stories while our work look at both quality and diversity. We explore combining BERT and GPT-2 to build a variational autoencoder (VAE), and extend it by adding additional objectives to learn global features such as story topic and discourse relations. Our evaluations show our enhanced VAE can provide better quality and diversity trade off, generate less repetitive story content and learn a more informative latent variable.
△ Less
Submitted 15 November, 2021;
originally announced November 2021.
-
Rumour Detection via Zero-shot Cross-lingual Transfer Learning
Authors:
Lin Tian,
Xiuzhen Zhang,
Jey Han Lau
Abstract:
Most rumour detection models for social media are designed for one specific language (mostly English). There are over 40 languages on Twitter and most languages lack annotated resources to build rumour detection models. In this paper we propose a zero-shot cross-lingual transfer learning framework that can adapt a rumour detection model trained for a source language to another target language. Our…
▽ More
Most rumour detection models for social media are designed for one specific language (mostly English). There are over 40 languages on Twitter and most languages lack annotated resources to build rumour detection models. In this paper we propose a zero-shot cross-lingual transfer learning framework that can adapt a rumour detection model trained for a source language to another target language. Our framework utilises pretrained multilingual language models (e.g.\ multilingual BERT) and a self-training loop to iteratively bootstrap the creation of ''silver labels'' in the target language to adapt the model from the source language to the target language. We evaluate our methodology on English and Chinese rumour datasets and demonstrate that our model substantially outperforms competitive benchmarks in both source and target language rumour detection.
△ Less
Submitted 26 September, 2021;
originally announced September 2021.
-
IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization
Authors:
Fajri Koto,
Jey Han Lau,
Timothy Baldwin
Abstract:
We present IndoBERTweet, the first large-scale pretrained model for Indonesian Twitter that is trained by extending a monolingually-trained Indonesian BERT model with additive domain-specific vocabulary. We focus in particular on efficient model adaptation under vocabulary mismatch, and benchmark different ways of initializing the BERT embedding layer for new word types. We find that initializing…
▽ More
We present IndoBERTweet, the first large-scale pretrained model for Indonesian Twitter that is trained by extending a monolingually-trained Indonesian BERT model with additive domain-specific vocabulary. We focus in particular on efficient model adaptation under vocabulary mismatch, and benchmark different ways of initializing the BERT embedding layer for new word types. We find that initializing with the average BERT subword embedding makes pretraining five times faster, and is more effective than proposed methods for vocabulary adaptation in terms of extrinsic evaluation over seven Twitter-based datasets.
△ Less
Submitted 9 September, 2021;
originally announced September 2021.
-
Automatic Claim Review for Climate Science via Explanation Generation
Authors:
Shraey Bhatia,
Jey Han Lau,
Timothy Baldwin
Abstract:
There is unison is the scientific community about human induced climate change. Despite this, we see the web awash with claims around climate change scepticism, thus driving the need for fact checking them but at the same time providing an explanation and justification for the fact check. Scientists and experts have been trying to address it by providing manually written feedback for these claims.…
▽ More
There is unison is the scientific community about human induced climate change. Despite this, we see the web awash with claims around climate change scepticism, thus driving the need for fact checking them but at the same time providing an explanation and justification for the fact check. Scientists and experts have been trying to address it by providing manually written feedback for these claims. In this paper, we try to aid them by automating generating explanation for a predicted veracity label for a claim by deploying the approach used in open domain question answering of a fusion in decoder augmented with retrieved supporting passages from an external knowledge. We experiment with different knowledge sources, retrievers, retriever depths and demonstrate that even a small number of high quality manually written explanations can help us in generating good explanations.
△ Less
Submitted 30 July, 2021;
originally announced July 2021.
-
Evaluating the Efficacy of Summarization Evaluation across Languages
Authors:
Fajri Koto,
Jey Han Lau,
Timothy Baldwin
Abstract:
While automatic summarization evaluation methods developed for English are routinely applied to other languages, this is the first attempt to systematically quantify their panlinguistic efficacy. We take a summarization corpus for eight different languages, and manually annotate generated summaries for focus (precision) and coverage (recall). Based on this, we evaluate 19 summarization evaluation…
▽ More
While automatic summarization evaluation methods developed for English are routinely applied to other languages, this is the first attempt to systematically quantify their panlinguistic efficacy. We take a summarization corpus for eight different languages, and manually annotate generated summaries for focus (precision) and coverage (recall). Based on this, we evaluate 19 summarization evaluation metrics, and find that using multilingual BERT within BERTScore performs well across all languages, at a level above that for English.
△ Less
Submitted 2 June, 2021;
originally announced June 2021.
-
Impact of detecting clinical trial elements in exploration of COVID-19 literature
Authors:
Simon Ĺ uster,
Karin Verspoor,
Timothy Baldwin,
Jey Han Lau,
Antonio Jimeno Yepes,
David Martinez,
Yulia Otmakhova
Abstract:
The COVID-19 pandemic has driven ever-greater demand for tools which enable efficient exploration of biomedical literature. Although semi-structured information resulting from concept recognition and detection of the defining elements of clinical trials (e.g. PICO criteria) has been commonly used to support literature search, the contributions of this abstraction remain poorly understood, especial…
▽ More
The COVID-19 pandemic has driven ever-greater demand for tools which enable efficient exploration of biomedical literature. Although semi-structured information resulting from concept recognition and detection of the defining elements of clinical trials (e.g. PICO criteria) has been commonly used to support literature search, the contributions of this abstraction remain poorly understood, especially in relation to text-based retrieval. In this study, we compare the results retrieved by a standard search engine with those filtered using clinically-relevant concepts and their relations. With analysis based on the annotations from the TREC-COVID shared task, we obtain quantitative as well as qualitative insights into characteristics of relational and concept-based literature exploration. Most importantly, we find that the relational concept selection filters the original retrieved collection in a way that decreases the proportion of unjudged documents and increases the precision, which means that the user is likely to be exposed to a larger number of relevant documents.
△ Less
Submitted 25 May, 2021;
originally announced May 2021.
-
Discourse Probing of Pretrained Language Models
Authors:
Fajri Koto,
Jey Han Lau,
Timothy Baldwin
Abstract:
Existing work on probing of pretrained language models (LMs) has predominantly focused on sentence-level syntactic tasks. In this paper, we introduce document-level discourse probing to evaluate the ability of pretrained LMs to capture document-level relations. We experiment with 7 pretrained LMs, 4 languages, and 7 discourse probing tasks, and find BART to be overall the best model at capturing d…
▽ More
Existing work on probing of pretrained language models (LMs) has predominantly focused on sentence-level syntactic tasks. In this paper, we introduce document-level discourse probing to evaluate the ability of pretrained LMs to capture document-level relations. We experiment with 7 pretrained LMs, 4 languages, and 7 discourse probing tasks, and find BART to be overall the best model at capturing discourse -- but only in its encoder, with BERT performing surprisingly well as the baseline model. Across the different models, there are substantial differences in which layers best capture discourse information, and large disparities between models.
△ Less
Submitted 12 April, 2021;
originally announced April 2021.
-
Grey-box Adversarial Attack And Defence For Sentiment Classification
Authors:
Ying Xu,
Xu Zhong,
Antonio Jimeno Yepes,
Jey Han Lau
Abstract:
We introduce a grey-box adversarial attack and defence framework for sentiment classification. We address the issues of differentiability, label preservation and input reconstruction for adversarial attack and defence in one unified framework. Our results show that once trained, the attacking model is capable of generating high-quality adversarial examples substantially faster (one order of magnit…
▽ More
We introduce a grey-box adversarial attack and defence framework for sentiment classification. We address the issues of differentiability, label preservation and input reconstruction for adversarial attack and defence in one unified framework. Our results show that once trained, the attacking model is capable of generating high-quality adversarial examples substantially faster (one order of magnitude less in time) than state-of-the-art attacking methods. These examples also preserve the original sentiment according to human evaluation. Additionally, our framework produces an improved classifier that is robust in defending against multiple adversarial attacking methods. Code is available at: https://github.com/ibm-aur-nlp/adv-def-text-dist.
△ Less
Submitted 22 March, 2021;
originally announced March 2021.
-
Top-down Discourse Parsing via Sequence Labelling
Authors:
Fajri Koto,
Jey Han Lau,
Timothy Baldwin
Abstract:
We introduce a top-down approach to discourse parsing that is conceptually simpler than its predecessors (Kobayashi et al., 2020; Zhang et al., 2020). By framing the task as a sequence labelling problem where the goal is to iteratively segment a document into individual discourse units, we are able to eliminate the decoder and reduce the search space for splitting points. We explore both tradition…
▽ More
We introduce a top-down approach to discourse parsing that is conceptually simpler than its predecessors (Kobayashi et al., 2020; Zhang et al., 2020). By framing the task as a sequence labelling problem where the goal is to iteratively segment a document into individual discourse units, we are able to eliminate the decoder and reduce the search space for splitting points. We explore both traditional recurrent models and modern pre-trained transformer models for the task, and additionally introduce a novel dynamic oracle for top-down parsing. Based on the Full metric, our proposed LSTM model sets a new state-of-the-art for RST parsing.
△ Less
Submitted 5 April, 2021; v1 submitted 3 February, 2021;
originally announced February 2021.
-
FFCI: A Framework for Interpretable Automatic Evaluation of Summarization
Authors:
Fajri Koto,
Timothy Baldwin,
Jey Han Lau
Abstract:
In this paper, we propose FFCI, a framework for fine-grained summarization evaluation that comprises four elements: faithfulness (degree of factual consistency with the source), focus (precision of summary content relative to the reference), coverage (recall of summary content relative to the reference), and inter-sentential coherence (document fluency between adjacent sentences). We construct a n…
▽ More
In this paper, we propose FFCI, a framework for fine-grained summarization evaluation that comprises four elements: faithfulness (degree of factual consistency with the source), focus (precision of summary content relative to the reference), coverage (recall of summary content relative to the reference), and inter-sentential coherence (document fluency between adjacent sentences). We construct a novel dataset for focus, coverage, and inter-sentential coherence, and develop automatic methods for evaluating each of the four dimensions of FFCI based on cross-comparison of evaluation metrics and model-based evaluation methods, including question answering (QA) approaches, semantic textual similarity (STS), next-sentence prediction (NSP), and scores derived from 19 pre-trained language models. We then apply the developed metrics in evaluating a broad range of summarization models across two datasets, with some surprising findings.
△ Less
Submitted 27 February, 2022; v1 submitted 27 November, 2020;
originally announced November 2020.
-
Liputan6: A Large-scale Indonesian Dataset for Text Summarization
Authors:
Fajri Koto,
Jey Han Lau,
Timothy Baldwin
Abstract:
In this paper, we introduce a large-scale Indonesian summarization dataset. We harvest articles from Liputan6.com, an online news portal, and obtain 215,827 document-summary pairs. We leverage pre-trained language models to develop benchmark extractive and abstractive summarization methods over the dataset with multilingual and monolingual BERT-based models. We include a thorough error analysis by…
▽ More
In this paper, we introduce a large-scale Indonesian summarization dataset. We harvest articles from Liputan6.com, an online news portal, and obtain 215,827 document-summary pairs. We leverage pre-trained language models to develop benchmark extractive and abstractive summarization methods over the dataset with multilingual and monolingual BERT-based models. We include a thorough error analysis by examining machine-generated summaries that have low ROUGE scores, and expose both issues with ROUGE it-self, as well as with extractive and abstractive summarization models.
△ Less
Submitted 1 November, 2020;
originally announced November 2020.
-
IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP
Authors:
Fajri Koto,
Afshin Rahimi,
Jey Han Lau,
Timothy Baldwin
Abstract:
Although the Indonesian language is spoken by almost 200 million people and the 10th most spoken language in the world, it is under-represented in NLP research. Previous work on Indonesian has been hampered by a lack of annotated datasets, a sparsity of language resources, and a lack of resource standardization. In this work, we release the IndoLEM dataset comprising seven tasks for the Indonesian…
▽ More
Although the Indonesian language is spoken by almost 200 million people and the 10th most spoken language in the world, it is under-represented in NLP research. Previous work on Indonesian has been hampered by a lack of annotated datasets, a sparsity of language resources, and a lack of resource standardization. In this work, we release the IndoLEM dataset comprising seven tasks for the Indonesian language, spanning morpho-syntax, semantics, and discourse. We additionally release IndoBERT, a new pre-trained language model for Indonesian, and evaluate it over IndoLEM, in addition to benchmarking it against existing resources. Our experiments show that IndoBERT achieves state-of-the-art performance over most of the tasks in IndoLEM.
△ Less
Submitted 1 November, 2020;
originally announced November 2020.
-
Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora
Authors:
Takashi Wada,
Tomoharu Iwata,
Yuji Matsumoto,
Timothy Baldwin,
Jey Han Lau
Abstract:
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus (e.g. a few hundred sentence pairs). Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence. Through sharing model parameters among different languages, our model jointly trains the word embeddings in a…
▽ More
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus (e.g. a few hundred sentence pairs). Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence. Through sharing model parameters among different languages, our model jointly trains the word embeddings in a common cross-lingual space. We also propose to combine word and subword embeddings to make use of orthographic similarities across different languages. We base our experiments on real-world data from endangered languages, namely Yongning Na, Shipibo-Konibo, and Griko. Our experiments on bilingual lexicon induction and word alignment tasks show that our model outperforms existing methods by a large margin for most language pairs. These results demonstrate that, contrary to common belief, an encoder-decoder translation model is beneficial for learning cross-lingual representations even in extremely low-resource conditions. Furthermore, our model also works well on high-resource conditions, achieving state-of-the-art performance on a German-English word-alignment task.
△ Less
Submitted 19 October, 2021; v1 submitted 27 October, 2020;
originally announced October 2020.
-
COVID-SEE: Scientific Evidence Explorer for COVID-19 Related Research
Authors:
Karin Verspoor,
Simon Ĺ uster,
Yulia Otmakhova,
Shevon Mendis,
Zenan Zhai,
Biaoyan Fang,
Jey Han Lau,
Timothy Baldwin,
Antonio Jimeno Yepes,
David Martinez
Abstract:
We present COVID-SEE, a system for medical literature discovery based on the concept of information exploration, which builds on several distinct text analysis and natural language processing methods to structure and organise information in publications, and augments search by providing a visual overview supporting exploration of a collection to identify key articles of interest. We developed this…
▽ More
We present COVID-SEE, a system for medical literature discovery based on the concept of information exploration, which builds on several distinct text analysis and natural language processing methods to structure and organise information in publications, and augments search by providing a visual overview supporting exploration of a collection to identify key articles of interest. We developed this system over COVID-19 literature to help medical professionals and researchers explore the literature evidence, and improve findability of relevant information. COVID-SEE is available at http://covid-see.com.
△ Less
Submitted 18 August, 2020;
originally announced August 2020.
-
Less is More: Rejecting Unreliable Reviews for Product Question Answering
Authors:
Shiwei Zhang,
Xiuzhen Zhang,
Jey Han Lau,
Jeffrey Chan,
Cecile Paris
Abstract:
Promptly and accurately answering questions on products is important for e-commerce applications. Manually answering product questions (e.g. on community question answering platforms) results in slow response and does not scale. Recent studies show that product reviews are a good source for real-time, automatic product question answering (PQA). In the literature, PQA is formulated as a retrieval p…
▽ More
Promptly and accurately answering questions on products is important for e-commerce applications. Manually answering product questions (e.g. on community question answering platforms) results in slow response and does not scale. Recent studies show that product reviews are a good source for real-time, automatic product question answering (PQA). In the literature, PQA is formulated as a retrieval problem with the goal to search for the most relevant reviews to answer a given product question. In this paper, we focus on the issue of answerability and answer reliability for PQA using reviews. Our investigation is based on the intuition that many questions may not be answerable with a finite set of reviews. When a question is not answerable, a system should return nil answers rather than providing a list of irrelevant reviews, which can have significant negative impact on user experience. Moreover, for answerable questions, only the most relevant reviews that answer the question should be included in the result. We propose a conformal prediction based framework to improve the reliability of PQA systems, where we reject unreliable answers so that the returned results are more concise and accurate at answering the product question, including returning nil answers for unanswerable questions. Experiments on a widely used Amazon dataset show encouraging results of our proposed framework. More broadly, our results demonstrate a novel and effective application of conformal methods to a retrieval task.
△ Less
Submitted 8 July, 2020;
originally announced July 2020.
-
Give Me Convenience and Give Her Death: Who Should Decide What Uses of NLP are Appropriate, and on What Basis?
Authors:
Kobi Leins,
Jey Han Lau,
Timothy Baldwin
Abstract:
As part of growing NLP capabilities, coupled with an awareness of the ethical dimensions of research, questions have been raised about whether particular datasets and tasks should be deemed off-limits for NLP research. We examine this question with respect to a paper on automatic legal sentencing from EMNLP 2019 which was a source of some debate, in asking whether the paper should have been allowe…
▽ More
As part of growing NLP capabilities, coupled with an awareness of the ethical dimensions of research, questions have been raised about whether particular datasets and tasks should be deemed off-limits for NLP research. We examine this question with respect to a paper on automatic legal sentencing from EMNLP 2019 which was a source of some debate, in asking whether the paper should have been allowed to be published, who should have been charged with making such a decision, and on what basis. We focus in particular on the role of data statements in ethically assessing research, but also discuss the topic of dual use, and examine the outcomes of similar debates in other scientific disciplines.
△ Less
Submitted 27 May, 2020;
originally announced May 2020.
-
You are right. I am ALARMED -- But by Climate Change Counter Movement
Authors:
Shraey Bhatia,
Jey Han Lau,
Timothy Baldwin
Abstract:
The world is facing the challenge of climate crisis. Despite the consensus in scientific community about anthropogenic global warming, the web is flooded with articles spreading climate misinformation. These articles are carefully constructed by climate change counter movement (cccm) organizations to influence the narrative around climate change. We revisit the literature on climate misinformation…
▽ More
The world is facing the challenge of climate crisis. Despite the consensus in scientific community about anthropogenic global warming, the web is flooded with articles spreading climate misinformation. These articles are carefully constructed by climate change counter movement (cccm) organizations to influence the narrative around climate change. We revisit the literature on climate misinformation in social sciences and repackage it to introduce in the community of NLP. Despite considerable work in detection of fake news, there is no misinformation dataset available that is specific to the domain.of climate change. We try to bridge this gap by scra** and releasing articles with known climate change misinformation.
△ Less
Submitted 30 April, 2020;
originally announced April 2020.
-
How Furiously Can Colourless Green Ideas Sleep? Sentence Acceptability in Context
Authors:
Jey Han Lau,
Carlos S. Armendariz,
Shalom Lappin,
Matthew Purver,
Chang Shu
Abstract:
We study the influence of context on sentence acceptability. First we compare the acceptability ratings of sentences judged in isolation, with a relevant context, and with an irrelevant context. Our results show that context induces a cognitive load for humans, which compresses the distribution of ratings. Moreover, in relevant contexts we observe a discourse coherence effect which uniformly raise…
▽ More
We study the influence of context on sentence acceptability. First we compare the acceptability ratings of sentences judged in isolation, with a relevant context, and with an irrelevant context. Our results show that context induces a cognitive load for humans, which compresses the distribution of ratings. Moreover, in relevant contexts we observe a discourse coherence effect which uniformly raises acceptability. Next, we test unidirectional and bidirectional language models in their ability to predict acceptability ratings. The bidirectional models show very promising results, with the best model achieving a new state-of-the-art for unsupervised acceptability prediction. The two sets of experiments provide insights into the cognitive aspects of sentence processing and central issues in the computational modelling of text and discourse.
△ Less
Submitted 2 April, 2020;
originally announced April 2020.
-
Elephant in the Room: An Evaluation Framework for Assessing Adversarial Examples in NLP
Authors:
Ying Xu,
Xu Zhong,
Antonio Jose Jimeno Yepes,
Jey Han Lau
Abstract:
An adversarial example is an input transformed by small perturbations that machine learning models consistently misclassify. While there are a number of methods proposed to generate adversarial examples for text data, it is not trivial to assess the quality of these adversarial examples, as minor perturbations (such as changing a word in a sentence) can lead to a significant shift in their meaning…
▽ More
An adversarial example is an input transformed by small perturbations that machine learning models consistently misclassify. While there are a number of methods proposed to generate adversarial examples for text data, it is not trivial to assess the quality of these adversarial examples, as minor perturbations (such as changing a word in a sentence) can lead to a significant shift in their meaning, readability and classification label. In this paper, we propose an evaluation framework consisting of a set of automatic evaluation metrics and human evaluation guidelines, to rigorously assess the quality of adversarial examples based on the aforementioned properties. We experiment with six benchmark attacking methods and found that some methods generate adversarial examples with poor readability and content preservation. We also learned that multiple factors could influence the attacking performance, such as the length of the text inputs and architecture of the classifiers.
△ Less
Submitted 1 June, 2020; v1 submitted 21 January, 2020;
originally announced January 2020.
-
Improved Document Modelling with a Neural Discourse Parser
Authors:
Fajri Koto,
Jey Han Lau,
Timothy Baldwin
Abstract:
Despite the success of attention-based neural models for natural language generation and classification tasks, they are unable to capture the discourse structure of larger documents. We hypothesize that explicit discourse representations have utility for NLP tasks over longer documents or document sequences, which sequence-to-sequence models are unable to capture. For abstractive summarization, fo…
▽ More
Despite the success of attention-based neural models for natural language generation and classification tasks, they are unable to capture the discourse structure of larger documents. We hypothesize that explicit discourse representations have utility for NLP tasks over longer documents or document sequences, which sequence-to-sequence models are unable to capture. For abstractive summarization, for instance, conventional neural models simply match source documents and the summary in a latent space without explicit representation of text structure or relations. In this paper, we propose to use neural discourse representations obtained from a rhetorical structure theory (RST) parser to enhance document representations. Specifically, document representations are generated for discourse spans, known as the elementary discourse units (EDUs). We empirically investigate the benefit of the proposed approach on two different tasks: abstractive summarization and popularity prediction of online petitions. We find that the proposed approach leads to improvements in all cases.
△ Less
Submitted 15 November, 2019;
originally announced November 2019.
-
Forget Me Not: Reducing Catastrophic Forgetting for Domain Adaptation in Reading Comprehension
Authors:
Y. Xu,
X. Zhong,
A. J. J. Yepes,
J. H. Lau
Abstract:
The creation of large-scale open domain reading comprehension data sets in recent years has enabled the development of end-to-end neural comprehension models with promising results. To use these models for domains with limited training data, one of the most effective approach is to first pretrain them on large out-of-domain source data and then fine-tune them with the limited target data. The cave…
▽ More
The creation of large-scale open domain reading comprehension data sets in recent years has enabled the development of end-to-end neural comprehension models with promising results. To use these models for domains with limited training data, one of the most effective approach is to first pretrain them on large out-of-domain source data and then fine-tune them with the limited target data. The caveat of this is that after fine-tuning the comprehension models tend to perform poorly in the source domain, a phenomenon known as catastrophic forgetting. In this paper, we explore methods that overcome catastrophic forgetting during fine-tuning without assuming access to data from the source domain. We introduce new auxiliary penalty terms and observe the best performance when a combination of auxiliary penalty terms is used to regularise the fine-tuning process for adapting comprehension models. To test our methods, we develop and release 6 narrow domain data sets that could potentially be used as reading comprehension benchmarks.
△ Less
Submitted 19 November, 2020; v1 submitted 1 November, 2019;
originally announced November 2019.
-
Deep-speare: A Joint Neural Model of Poetic Language, Meter and Rhyme
Authors:
Jey Han Lau,
Trevor Cohn,
Timothy Baldwin,
Julian Brooke,
Adam Hammond
Abstract:
In this paper, we propose a joint architecture that captures language, rhyme and meter for sonnet modelling. We assess the quality of generated poems using crowd and expert judgements. The stress and rhyme models perform very well, as generated poems are largely indistinguishable from human-written poems. Expert evaluation, however, reveals that a vanilla language model captures meter implicitly,…
▽ More
In this paper, we propose a joint architecture that captures language, rhyme and meter for sonnet modelling. We assess the quality of generated poems using crowd and expert judgements. The stress and rhyme models perform very well, as generated poems are largely indistinguishable from human-written poems. Expert evaluation, however, reveals that a vanilla language model captures meter implicitly, and that machine-generated poems still underperform in terms of readability and emotion. Our research shows the importance expert evaluation for poetry generation, and that future research should look beyond rhyme/meter and focus on poetic language.
△ Less
Submitted 10 July, 2018;
originally announced July 2018.
-
Document Chunking and Learning Objective Generation for Instruction Design
Authors:
Khoi-Nguyen Tran,
Jey Han Lau,
Danish Contractor,
Utkarsh Gupta,
Bikram Sengupta,
Christopher J. Butler,
Mukesh Mohania
Abstract:
Instructional Systems Design is the practice of creating of instructional experiences that make the acquisition of knowledge and skill more efficient, effective, and appealing. Specifically in designing courses, an hour of training material can require between 30 to 500 hours of effort in sourcing and organizing reference data for use in just the preparation of course material. In this paper, we p…
▽ More
Instructional Systems Design is the practice of creating of instructional experiences that make the acquisition of knowledge and skill more efficient, effective, and appealing. Specifically in designing courses, an hour of training material can require between 30 to 500 hours of effort in sourcing and organizing reference data for use in just the preparation of course material. In this paper, we present the first system of its kind that helps reduce the effort associated with sourcing reference material and course creation. We present algorithms for document chunking and automatic generation of learning objectives from content, creating descriptive content metadata to improve content-discoverability. Unlike existing methods, the learning objectives generated by our system incorporate pedagogically motivated Bloom's verbs. We demonstrate the usefulness of our methods using real world data from the banking industry and through a live deployment at a large pharmaceutical company.
△ Less
Submitted 5 August, 2018; v1 submitted 1 June, 2018;
originally announced June 2018.
-
Evaluating Word Embedding Hyper-Parameters for Similarity and Analogy Tasks
Authors:
Maryam Fanaeepour,
Adam Makarucha,
Jey Han Lau
Abstract:
The versatility of word embeddings for various applications is attracting researchers from various fields. However, the impact of hyper-parameters when training embedding model is often poorly understood. How much do hyper-parameters such as vector dimensions and corpus size affect the quality of embeddings, and how do these results translate to downstream applications? Using standard embedding ev…
▽ More
The versatility of word embeddings for various applications is attracting researchers from various fields. However, the impact of hyper-parameters when training embedding model is often poorly understood. How much do hyper-parameters such as vector dimensions and corpus size affect the quality of embeddings, and how do these results translate to downstream applications? Using standard embedding evaluation metrics and datasets, we conduct a study to empirically measure the impact of these hyper-parameters.
△ Less
Submitted 11 April, 2018;
originally announced April 2018.
-
End-to-end Network for Twitter Geolocation Prediction and Hashing
Authors:
Jey Han Lau,
Lianhua Chi,
Khoi-Nguyen Tran,
Trevor Cohn
Abstract:
We propose an end-to-end neural network to predict the geolocation of a tweet. The network takes as input a number of raw Twitter metadata such as the tweet message and associated user account information. Our model is language independent, and despite minimal feature engineering, it is interpretable and capable of learning location indicative words and timing patterns. Compared to state-of-the-ar…
▽ More
We propose an end-to-end neural network to predict the geolocation of a tweet. The network takes as input a number of raw Twitter metadata such as the tweet message and associated user account information. Our model is language independent, and despite minimal feature engineering, it is interpretable and capable of learning location indicative words and timing patterns. Compared to state-of-the-art systems, our model outperforms them by 2%-6%. Additionally, we propose extensions to the model to compress representation learnt by the network into binary codes. Experiments show that it produces compact codes compared to benchmark hashing algorithms. An implementation of the model is released publicly.
△ Less
Submitted 13 October, 2017;
originally announced October 2017.
-
An Automatic Approach for Document-level Topic Model Evaluation
Authors:
Shraey Bhatia,
Jey Han Lau,
Timothy Baldwin
Abstract:
Topic models jointly learn topics and document-level topic distribution. Extrinsic evaluation of topic models tends to focus exclusively on topic-level evaluation, e.g. by assessing the coherence of topics. We demonstrate that there can be large discrepancies between topic- and document-level model quality, and that basing model evaluation on topic-level analysis can be highly misleading. We propo…
▽ More
Topic models jointly learn topics and document-level topic distribution. Extrinsic evaluation of topic models tends to focus exclusively on topic-level evaluation, e.g. by assessing the coherence of topics. We demonstrate that there can be large discrepancies between topic- and document-level model quality, and that basing model evaluation on topic-level analysis can be highly misleading. We propose a method for automatically predicting topic model quality based on analysis of document-level topic allocations, and provide empirical evidence for its robustness.
△ Less
Submitted 15 June, 2017;
originally announced June 2017.
-
Topically Driven Neural Language Model
Authors:
Jey Han Lau,
Timothy Baldwin,
Trevor Cohn
Abstract:
Language models are typically applied at the sentence level, without access to the broader document context. We present a neural language model that incorporates document context in the form of a topic model-like architecture, thus providing a succinct representation of the broader document context outside of the current sentence. Experiments over a range of datasets demonstrate that our model out…
▽ More
Language models are typically applied at the sentence level, without access to the broader document context. We present a neural language model that incorporates document context in the form of a topic model-like architecture, thus providing a succinct representation of the broader document context outside of the current sentence. Experiments over a range of datasets demonstrate that our model outperforms a pure sentence-based model in terms of language model perplexity, and leads to topics that are potentially more coherent than those produced by a standard LDA topic model. Our model also has the ability to generate related sentences for a topic, providing another way to interpret topics.
△ Less
Submitted 2 May, 2017; v1 submitted 26 April, 2017;
originally announced April 2017.