Skip to main content

Showing 1–50 of 52 results for author: Lau, J H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.14709  [pdf, other

    cs.CL

    Factual Dialogue Summarization via Learning from Large Language Models

    Authors: Rongxin Zhu, Jey Han Lau, Jianzhong Qi

    Abstract: Factual consistency is an important quality in dialogue summarization. Large language model (LLM)-based automatic text summarization models generate more factually consistent summaries compared to those by smaller pretrained language models, but they face deployment challenges in real-world applications due to privacy or resource constraints. In this paper, we investigate the use of symbolic knowl… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    ACM Class: F.2.2; I.2.7

  2. arXiv:2406.12645  [pdf, other

    cs.CL cs.AI

    Evaluating Transparency of Machine Generated Fact Checking Explanations

    Authors: Rui Xing, Timothy Baldwin, Jey Han Lau

    Abstract: An important factor when it comes to generating fact-checking explanations is the selection of evidence: intuitively, high-quality explanations can only be generated given the right evidence. In this work, we investigate the impact of human-curated vs. machine-selected evidence for explanation generation using large language models. To assess the quality of explanations, we focus on transparency (… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  3. arXiv:2402.18005  [pdf, other

    cs.CL cs.AI

    A Sentiment Consolidation Framework for Meta-Review Generation

    Authors: Miao Li, Jey Han Lau, Eduard Hovy

    Abstract: Modern natural language generation systems with Large Language Models (LLMs) exhibit the capability to generate a plausible summary of multiple documents; however, it is uncertain if they truly possess the capability of information consolidation to generate summaries, especially on documents with opinionated information. We focus on meta-review generation, a form of sentiment summarisation for the… ▽ More

    Submitted 4 June, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

    Comments: Long paper, ACL 2024 Main

  4. arXiv:2402.08155  [pdf, other

    cs.CL cs.AI

    CMA-R:Causal Mediation Analysis for Explaining Rumour Detection

    Authors: Lin Tian, Xiuzhen Zhang, Jey Han Lau

    Abstract: We apply causal mediation analysis to explain the decision-making process of neural models for rumour detection on Twitter. Interventions at the input and network level reveal the causal impacts of tweets and words in the model output. We find that our approach CMA-R -- Causal Mediation Analysis for Rumour detection -- identifies salient tweets that explain model predictions and show strong agreem… ▽ More

    Submitted 12 February, 2024; originally announced February 2024.

    Comments: 9 pages, 7 figures, Accepted by EACL 2024 Findings

  5. arXiv:2311.00310  [pdf, other

    cs.CL cs.AI

    Unsupervised Lexical Simplification with Context Augmentation

    Authors: Takashi Wada, Timothy Baldwin, Jey Han Lau

    Abstract: We propose a new unsupervised lexical simplification method that uses only monolingual data and pre-trained language models. Given a target word and its context, our method generates substitutes based on the target context and also additional contexts sampled from monolingual data. We conduct experiments in English, Portuguese, and Spanish on the TSAR-2022 shared task, and show that our model subs… ▽ More

    Submitted 1 November, 2023; originally announced November 2023.

    Comments: 12 pages; accepted for the Findings of EMNLP 2023

  6. arXiv:2306.01443  [pdf, other

    cs.CL cs.AI cs.LG

    Unsupervised Paraphrasing of Multiword Expressions

    Authors: Takashi Wada, Yuji Matsumoto, Timothy Baldwin, Jey Han Lau

    Abstract: We propose an unsupervised approach to paraphrasing multiword expressions (MWEs) in context. Our model employs only monolingual corpus data and pre-trained language models (without fine-tuning), and does not make use of any external resources such as dictionaries. We evaluate our method on the SemEval 2022 idiomatic semantic text similarity task, and show that it outperforms all unsupervised syste… ▽ More

    Submitted 2 June, 2023; originally announced June 2023.

    Comments: 13 pages; accepted for Findings of ACL 2023

  7. arXiv:2305.16548  [pdf, other

    cs.CL cs.AI

    Annotating and Detecting Fine-grained Factual Errors for Dialogue Summarization

    Authors: Rongxin Zhu, Jianzhong Qi, Jey Han Lau

    Abstract: A series of datasets and models have been proposed for summaries generated for well-formatted documents such as news articles. Dialogue summaries, however, have been under explored. In this paper, we present the first dataset with fine-grained factual error annotations named DIASUMFACT. We define fine-grained factual error detection as a sentence-level multi-label classification problem, and we ev… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

    Comments: Accepted in ACL 2023

  8. arXiv:2305.01498  [pdf, other

    cs.CL cs.AI

    Summarizing Multiple Documents with Conversational Structure for Meta-Review Generation

    Authors: Miao Li, Eduard Hovy, Jey Han Lau

    Abstract: We present PeerSum, a novel dataset for generating meta-reviews of scientific papers. The meta-reviews can be interpreted as abstractive summaries of reviews, multi-turn discussions and the paper abstract. These source documents have rich inter-document relationships with an explicit hierarchical conversational structure, cross-references and (occasionally) conflicting information. To introduce th… ▽ More

    Submitted 23 October, 2023; v1 submitted 2 May, 2023; originally announced May 2023.

    Comments: Long paper; Accepted to EMNLP 2023; Soundness: 3, 3, 4; Excitement: 3, 4, 4

  9. arXiv:2303.08991  [pdf, other

    cs.CL

    DeltaScore: Fine-Grained Story Evaluation with Perturbations

    Authors: Zhuohan Xie, Miao Li, Trevor Cohn, Jey Han Lau

    Abstract: Numerous evaluation metrics have been developed for natural language generation tasks, but their effectiveness in evaluating stories is limited as they are not specifically tailored to assess intricate aspects of storytelling, such as fluency and interestingness. In this paper, we introduce DELTASCORE, a novel methodology that employs perturbation techniques for the evaluation of nuanced story asp… ▽ More

    Submitted 2 November, 2023; v1 submitted 15 March, 2023; originally announced March 2023.

    Comments: 15 pages, 3 figures, 8 tables. Camera ready version for EMNLP 2023 findings

  10. MetaTroll: Few-shot Detection of State-Sponsored Trolls with Transformer Adapters

    Authors: Lin Tian, Xiuzhen Zhang, Jey Han Lau

    Abstract: State-sponsored trolls are the main actors of influence campaigns on social media and automatic troll detection is important to combat misinformation at scale. Existing troll detection models are developed based on training data for known campaigns (e.g.\ the influence campaign by Russia's Internet Research Agency on the 2016 US Election), and they fall short when dealing with {\em novel} campaign… ▽ More

    Submitted 13 March, 2023; originally announced March 2023.

    Comments: 11 pages, 2 figures, Accepted by the Web Conference 2023 (WWW 2023)

  11. arXiv:2303.06565  [pdf, other

    cs.CL cs.AI

    Compressed Heterogeneous Graph for Abstractive Multi-Document Summarization

    Authors: Miao Li, Jianzhong Qi, Jey Han Lau

    Abstract: Multi-document summarization (MDS) aims to generate a summary for a number of related documents. We propose HGSUM, an MDS model that extends an encoder-decoder architecture, to incorporate a heterogeneous graph to represent different semantic units (e.g., words and sentences) of the documents. This contrasts with existing MDS models which do not consider different edge types of graphs and as such… ▽ More

    Submitted 11 March, 2023; originally announced March 2023.

    Comments: AAAI 2023

  12. arXiv:2301.09790  [pdf, other

    cs.CL

    The Next Chapter: A Study of Large Language Models in Storytelling

    Authors: Zhuohan Xie, Trevor Cohn, Jey Han Lau

    Abstract: To enhance the quality of generated stories, recent story generation models have been investigating the utilization of higher-level attributes like plots or commonsense knowledge. The application of prompt-based learning with large language models (LLMs), exemplified by GPT-3, has exhibited remarkable performance in diverse natural language processing (NLP) tasks. This paper conducts a comprehensi… ▽ More

    Submitted 24 July, 2023; v1 submitted 23 January, 2023; originally announced January 2023.

    Comments: Accepted to INLG2023

  13. arXiv:2210.03256  [pdf, other

    cs.CL

    Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal Negation

    Authors: Thinh Hung Truong, Yulia Otmakhova, Timothy Baldwin, Trevor Cohn, Jey Han Lau, Karin Verspoor

    Abstract: Negation is poorly captured by current language models, although the extent of this problem is not widely understood. We introduce a natural language inference (NLI) test suite to enable probing the capabilities of NLP methods, with the aim of understanding sub-clausal negation. The test suite contains premise--hypothesis pairs where the premise contains sub-clausal negation and the hypothesis is… ▽ More

    Submitted 13 October, 2022; v1 submitted 6 October, 2022; originally announced October 2022.

    Comments: AACL-ICJNLP 2022

  14. arXiv:2210.02206  [pdf, other

    cs.MM

    Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective

    Authors: Zijian Zhang, Chang Shu, Ya Xiao, Yuan Shen, Di Zhu, **g Xiao, Youxin Chen, Jey Han Lau, Qian Zhang, Zheng Lu

    Abstract: Visual-Semantic Embedding (VSE) aims to learn an embedding space where related visual and semantic instances are close to each other. Recent VSE models tend to design complex structures to pool visual and semantic features into fixed-length vectors and use hard triplet loss for optimization. However, we find that: (1) combining simple pooling methods is no worse than these sophisticated methods; a… ▽ More

    Submitted 5 October, 2022; originally announced October 2022.

  15. arXiv:2209.08698  [pdf, other

    cs.CL

    LED down the rabbit hole: exploring the potential of global attention for biomedical multi-document summarisation

    Authors: Yulia Otmakhova, Hung Thinh Truong, Timothy Baldwin, Trevor Cohn, Karin Verspoor, Jey Han Lau

    Abstract: In this paper we report on our submission to the Multidocument Summarisation for Literature Review (MSLR) shared task. Specifically, we adapt PRIMERA (Xiao et al., 2022) to the biomedical domain by placing global attention on important biomedical entities in several ways. We analyse the outputs of the 23 resulting models, and report patterns in the results related to the presence of additional glo… ▽ More

    Submitted 18 September, 2022; originally announced September 2022.

    Comments: SDP Workshop at COLING 2022

  16. arXiv:2209.08236  [pdf, other

    cs.CL cs.AI

    Unsupervised Lexical Substitution with Decontextualised Embeddings

    Authors: Takashi Wada, Timothy Baldwin, Yuji Matsumoto, Jey Han Lau

    Abstract: We propose a new unsupervised method for lexical substitution using pre-trained language models. Compared to previous approaches that use the generative capability of language models to predict substitutes, our method retrieves substitutes based on the similarity of contextualised and decontextualised word embeddings, i.e. the average contextual representation of a word in multiple contexts. We co… ▽ More

    Submitted 16 September, 2022; originally announced September 2022.

    Comments: 14 pages, accepted for COLING 2022

  17. arXiv:2205.15960  [pdf, other

    cs.CL

    NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

    Authors: Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Rahmad Mahendra, Fajri Koto, Ade Romadhony, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Pascale Fung, Timothy Baldwin, Jey Han Lau, Rico Sennrich, Sebastian Ruder

    Abstract: Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Chinese, while it remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on develo** re… ▽ More

    Submitted 12 April, 2023; v1 submitted 31 May, 2022; originally announced May 2022.

    Comments: EACL 2023

  18. arXiv:2205.10363  [pdf, other

    cs.CL

    Robust Task-Oriented Dialogue Generation with Contrastive Pre-training and Adversarial Filtering

    Authors: Shiquan Yang, Xinting Huang, Jey Han Lau, Sarah Erfani

    Abstract: Data artifacts incentivize machine learning models to learn non-transferable generalizations by taking advantage of shortcuts in the data, and there is growing evidence that data artifacts play a role for the strong results that deep learning models achieve in recent natural language processing benchmarks. In this paper, we focus on task-oriented dialogue and investigate whether popular datasets s… ▽ More

    Submitted 19 May, 2022; originally announced May 2022.

  19. arXiv:2203.13357  [pdf, other

    cs.CL

    One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia

    Authors: Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya, Ade Romadhony, Rahmad Mahendra, Kemal Kurniawan, David Moeljadi, Radityo Eko Prasojo, Timothy Baldwin, Jey Han Lau, Sebastian Ruder

    Abstract: NLP research is impeded by a lack of resources and awareness of the challenges presented by underrepresented languages and dialects. Focusing on the languages spoken in Indonesia, the second most linguistically diverse and the fourth most populous nation of the world, we provide an overview of the current state of NLP research for Indonesia's 700+ languages. We highlight challenges in Indonesian N… ▽ More

    Submitted 24 March, 2022; originally announced March 2022.

    Comments: Accepted in ACL 2022

  20. arXiv:2203.05843  [pdf, other

    cs.CL

    An Interpretable Neuro-Symbolic Reasoning Framework for Task-Oriented Dialogue Generation

    Authors: Shiquan Yang, Rui Zhang, Sarah Erfani, Jey Han Lau

    Abstract: We study the interpretability issue of task-oriented dialogue systems in this paper. Previously, most neural-based task-oriented dialogue systems employ an implicit reasoning strategy that makes the model predictions uninterpretable to humans. To obtain a transparent reasoning process, we introduce neuro-symbolic to perform explicit reasoning that justifies model decisions by reasoning chains. Sin… ▽ More

    Submitted 11 March, 2022; originally announced March 2022.

  21. arXiv:2203.01769   

    cs.IR cs.CL

    PeerSum: A Peer Review Dataset for Abstractive Multi-document Summarization

    Authors: Miao Li, Jianzhong Qi, Jey Han Lau

    Abstract: We present PeerSum, a new MDS dataset using peer reviews of scientific publications. Our dataset differs from the existing MDS datasets in that our summaries (i.e., the meta-reviews) are highly abstractive and they are real summaries of the source documents (i.e., the reviews) and it also features disagreements among source documents. We found that current state-of-the-art MDS models struggle to g… ▽ More

    Submitted 28 September, 2022; v1 submitted 3 March, 2022; originally announced March 2022.

    Comments: This is because the paper has changed so much and the arxiv paper no longer represents the PeerSum

  22. arXiv:2202.07858  [pdf, ps, other

    cs.CL cs.IR

    ITTC @ TREC 2021 Clinical Trials Track

    Authors: Thinh Hung Truong, Yulia Otmakhova, Rahmad Mahendra, Timothy Baldwin, Jey Han Lau, Trevor Cohn, Lawrence Cavedon, Damiano Spina, Karin Verspoor

    Abstract: This paper describes the submissions of the Natural Language Processing (NLP) team from the Australian Research Council Industrial Transformation Training Centre (ITTC) for Cognitive Computing in Medical Technologies to the TREC 2021 Clinical Trials Track. The task focuses on the problem of matching eligible clinical trials to topics constituting a summary of a patient's admission notes. We explor… ▽ More

    Submitted 15 February, 2022; originally announced February 2022.

    Comments: 7 pages

  23. arXiv:2112.05346  [pdf, other

    cs.CL

    Findings on Conversation Disentanglement

    Authors: Rongxin Zhu, Jey Han Lau, Jianzhong Qi

    Abstract: Conversation disentanglement, the task to identify separate threads in conversations, is an important pre-processing step in multi-party conversational NLP applications such as conversational question answering and conversation summarization. Framing it as a utterance-to-utterance classification problem -- i.e. given an utterance of interest (UOI), find which past utterance it replies to -- we exp… ▽ More

    Submitted 10 December, 2021; originally announced December 2021.

    Comments: accepted in ALTA 2021

  24. arXiv:2111.08133  [pdf, other

    cs.CL cs.LG

    Exploring Story Generation with Multi-task Objectives in Variational Autoencoders

    Authors: Zhuohan Xie, Trevor Cohn, Jey Han Lau

    Abstract: GPT-2 has been frequently adapted in story generation models as it provides powerful generative capability. However, it still fails to generate consistent stories and lacks diversity. Current story generation models leverage additional information such as plots or commonsense into GPT-2 to guide the generation process. These approaches focus on improving generation quality of stories while our wor… ▽ More

    Submitted 15 November, 2021; originally announced November 2021.

    Comments: 10 pages, 3 figures, ALTA2021

  25. arXiv:2109.12773  [pdf, other

    cs.CL

    Rumour Detection via Zero-shot Cross-lingual Transfer Learning

    Authors: Lin Tian, Xiuzhen Zhang, Jey Han Lau

    Abstract: Most rumour detection models for social media are designed for one specific language (mostly English). There are over 40 languages on Twitter and most languages lack annotated resources to build rumour detection models. In this paper we propose a zero-shot cross-lingual transfer learning framework that can adapt a rumour detection model trained for a source language to another target language. Our… ▽ More

    Submitted 26 September, 2021; originally announced September 2021.

    Comments: ECML-PKDD 2021

  26. arXiv:2109.04607  [pdf, other

    cs.CL

    IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effective Domain-Specific Vocabulary Initialization

    Authors: Fajri Koto, Jey Han Lau, Timothy Baldwin

    Abstract: We present IndoBERTweet, the first large-scale pretrained model for Indonesian Twitter that is trained by extending a monolingually-trained Indonesian BERT model with additive domain-specific vocabulary. We focus in particular on efficient model adaptation under vocabulary mismatch, and benchmark different ways of initializing the BERT embedding layer for new word types. We find that initializing… ▽ More

    Submitted 9 September, 2021; originally announced September 2021.

    Comments: Accepted at EMNLP 2021

  27. arXiv:2107.14740  [pdf, other

    cs.CL

    Automatic Claim Review for Climate Science via Explanation Generation

    Authors: Shraey Bhatia, Jey Han Lau, Timothy Baldwin

    Abstract: There is unison is the scientific community about human induced climate change. Despite this, we see the web awash with claims around climate change scepticism, thus driving the need for fact checking them but at the same time providing an explanation and justification for the fact check. Scientists and experts have been trying to address it by providing manually written feedback for these claims.… ▽ More

    Submitted 30 July, 2021; originally announced July 2021.

  28. arXiv:2106.01478  [pdf, other

    cs.CL

    Evaluating the Efficacy of Summarization Evaluation across Languages

    Authors: Fajri Koto, Jey Han Lau, Timothy Baldwin

    Abstract: While automatic summarization evaluation methods developed for English are routinely applied to other languages, this is the first attempt to systematically quantify their panlinguistic efficacy. We take a summarization corpus for eight different languages, and manually annotate generated summaries for focus (precision) and coverage (recall). Based on this, we evaluate 19 summarization evaluation… ▽ More

    Submitted 2 June, 2021; originally announced June 2021.

    Comments: Findings of ACL 2021

  29. arXiv:2105.12261  [pdf, other

    cs.CL cs.IR

    Impact of detecting clinical trial elements in exploration of COVID-19 literature

    Authors: Simon Ĺ uster, Karin Verspoor, Timothy Baldwin, Jey Han Lau, Antonio Jimeno Yepes, David Martinez, Yulia Otmakhova

    Abstract: The COVID-19 pandemic has driven ever-greater demand for tools which enable efficient exploration of biomedical literature. Although semi-structured information resulting from concept recognition and detection of the defining elements of clinical trials (e.g. PICO criteria) has been commonly used to support literature search, the contributions of this abstraction remain poorly understood, especial… ▽ More

    Submitted 25 May, 2021; originally announced May 2021.

    Comments: Accepted at HealthNLP'21

  30. arXiv:2104.05882  [pdf, other

    cs.CL

    Discourse Probing of Pretrained Language Models

    Authors: Fajri Koto, Jey Han Lau, Timothy Baldwin

    Abstract: Existing work on probing of pretrained language models (LMs) has predominantly focused on sentence-level syntactic tasks. In this paper, we introduce document-level discourse probing to evaluate the ability of pretrained LMs to capture document-level relations. We experiment with 7 pretrained LMs, 4 languages, and 7 discourse probing tasks, and find BART to be overall the best model at capturing d… ▽ More

    Submitted 12 April, 2021; originally announced April 2021.

    Comments: Accepted at NAACL 2021

  31. arXiv:2103.11576  [pdf, other

    cs.LG cs.AI cs.CL

    Grey-box Adversarial Attack And Defence For Sentiment Classification

    Authors: Ying Xu, Xu Zhong, Antonio Jimeno Yepes, Jey Han Lau

    Abstract: We introduce a grey-box adversarial attack and defence framework for sentiment classification. We address the issues of differentiability, label preservation and input reconstruction for adversarial attack and defence in one unified framework. Our results show that once trained, the attacking model is capable of generating high-quality adversarial examples substantially faster (one order of magnit… ▽ More

    Submitted 22 March, 2021; originally announced March 2021.

  32. arXiv:2102.02080  [pdf, other

    cs.CL

    Top-down Discourse Parsing via Sequence Labelling

    Authors: Fajri Koto, Jey Han Lau, Timothy Baldwin

    Abstract: We introduce a top-down approach to discourse parsing that is conceptually simpler than its predecessors (Kobayashi et al., 2020; Zhang et al., 2020). By framing the task as a sequence labelling problem where the goal is to iteratively segment a document into individual discourse units, we are able to eliminate the decoder and reduce the search space for splitting points. We explore both tradition… ▽ More

    Submitted 5 April, 2021; v1 submitted 3 February, 2021; originally announced February 2021.

    Comments: Accepted at EACL 2021

  33. arXiv:2011.13662  [pdf, other

    cs.CL

    FFCI: A Framework for Interpretable Automatic Evaluation of Summarization

    Authors: Fajri Koto, Timothy Baldwin, Jey Han Lau

    Abstract: In this paper, we propose FFCI, a framework for fine-grained summarization evaluation that comprises four elements: faithfulness (degree of factual consistency with the source), focus (precision of summary content relative to the reference), coverage (recall of summary content relative to the reference), and inter-sentential coherence (document fluency between adjacent sentences). We construct a n… ▽ More

    Submitted 27 February, 2022; v1 submitted 27 November, 2020; originally announced November 2020.

    Comments: Accepted at Journal of Artificial Intelligence Research (JAIR 2022)

  34. arXiv:2011.00679  [pdf, other

    cs.CL

    Liputan6: A Large-scale Indonesian Dataset for Text Summarization

    Authors: Fajri Koto, Jey Han Lau, Timothy Baldwin

    Abstract: In this paper, we introduce a large-scale Indonesian summarization dataset. We harvest articles from Liputan6.com, an online news portal, and obtain 215,827 document-summary pairs. We leverage pre-trained language models to develop benchmark extractive and abstractive summarization methods over the dataset with multilingual and monolingual BERT-based models. We include a thorough error analysis by… ▽ More

    Submitted 1 November, 2020; originally announced November 2020.

    Comments: Accepted at AACL-IJCNLP 2020

  35. arXiv:2011.00677  [pdf, other

    cs.CL

    IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP

    Authors: Fajri Koto, Afshin Rahimi, Jey Han Lau, Timothy Baldwin

    Abstract: Although the Indonesian language is spoken by almost 200 million people and the 10th most spoken language in the world, it is under-represented in NLP research. Previous work on Indonesian has been hampered by a lack of annotated datasets, a sparsity of language resources, and a lack of resource standardization. In this work, we release the IndoLEM dataset comprising seven tasks for the Indonesian… ▽ More

    Submitted 1 November, 2020; originally announced November 2020.

    Comments: Accepted at COLING 2020 - The 28th International Conference on Computational Linguistics

  36. arXiv:2010.14649  [pdf

    cs.CL cs.AI cs.LG

    Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora

    Authors: Takashi Wada, Tomoharu Iwata, Yuji Matsumoto, Timothy Baldwin, Jey Han Lau

    Abstract: We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus (e.g. a few hundred sentence pairs). Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence. Through sharing model parameters among different languages, our model jointly trains the word embeddings in a… ▽ More

    Submitted 19 October, 2021; v1 submitted 27 October, 2020; originally announced October 2020.

    Comments: 16 pages, accepted at the 1st Workshop on Multilingual Representation Learning

  37. arXiv:2008.07880  [pdf, other

    cs.CL cs.IR

    COVID-SEE: Scientific Evidence Explorer for COVID-19 Related Research

    Authors: Karin Verspoor, Simon Ĺ uster, Yulia Otmakhova, Shevon Mendis, Zenan Zhai, Biaoyan Fang, Jey Han Lau, Timothy Baldwin, Antonio Jimeno Yepes, David Martinez

    Abstract: We present COVID-SEE, a system for medical literature discovery based on the concept of information exploration, which builds on several distinct text analysis and natural language processing methods to structure and organise information in publications, and augments search by providing a visual overview supporting exploration of a collection to identify key articles of interest. We developed this… ▽ More

    Submitted 18 August, 2020; originally announced August 2020.

    Comments: COVID-SEE is available at http://covid-see.com

  38. arXiv:2007.04526  [pdf, other

    cs.CL cs.IR cs.LG

    Less is More: Rejecting Unreliable Reviews for Product Question Answering

    Authors: Shiwei Zhang, Xiuzhen Zhang, Jey Han Lau, Jeffrey Chan, Cecile Paris

    Abstract: Promptly and accurately answering questions on products is important for e-commerce applications. Manually answering product questions (e.g. on community question answering platforms) results in slow response and does not scale. Recent studies show that product reviews are a good source for real-time, automatic product question answering (PQA). In the literature, PQA is formulated as a retrieval p… ▽ More

    Submitted 8 July, 2020; originally announced July 2020.

    Comments: ECML-PKDD 2020

    ACM Class: I.2.7

  39. arXiv:2005.13213  [pdf, other

    cs.CL

    Give Me Convenience and Give Her Death: Who Should Decide What Uses of NLP are Appropriate, and on What Basis?

    Authors: Kobi Leins, Jey Han Lau, Timothy Baldwin

    Abstract: As part of growing NLP capabilities, coupled with an awareness of the ethical dimensions of research, questions have been raised about whether particular datasets and tasks should be deemed off-limits for NLP research. We examine this question with respect to a paper on automatic legal sentencing from EMNLP 2019 which was a source of some debate, in asking whether the paper should have been allowe… ▽ More

    Submitted 27 May, 2020; originally announced May 2020.

    Comments: 6 pages; accepted for ACL2020

  40. arXiv:2004.14907  [pdf, ps, other

    cs.CL

    You are right. I am ALARMED -- But by Climate Change Counter Movement

    Authors: Shraey Bhatia, Jey Han Lau, Timothy Baldwin

    Abstract: The world is facing the challenge of climate crisis. Despite the consensus in scientific community about anthropogenic global warming, the web is flooded with articles spreading climate misinformation. These articles are carefully constructed by climate change counter movement (cccm) organizations to influence the narrative around climate change. We revisit the literature on climate misinformation… ▽ More

    Submitted 30 April, 2020; originally announced April 2020.

    Comments: 5 pages

  41. arXiv:2004.00881  [pdf, other

    cs.CL

    How Furiously Can Colourless Green Ideas Sleep? Sentence Acceptability in Context

    Authors: Jey Han Lau, Carlos S. Armendariz, Shalom Lappin, Matthew Purver, Chang Shu

    Abstract: We study the influence of context on sentence acceptability. First we compare the acceptability ratings of sentences judged in isolation, with a relevant context, and with an irrelevant context. Our results show that context induces a cognitive load for humans, which compresses the distribution of ratings. Moreover, in relevant contexts we observe a discourse coherence effect which uniformly raise… ▽ More

    Submitted 2 April, 2020; originally announced April 2020.

    Comments: 14 pages. Author's final version, accepted for publication in Transactions of the Association for Computational Linguistics

    ACM Class: I.2.7

  42. arXiv:2001.07820  [pdf, other

    cs.CL cs.LG

    Elephant in the Room: An Evaluation Framework for Assessing Adversarial Examples in NLP

    Authors: Ying Xu, Xu Zhong, Antonio Jose Jimeno Yepes, Jey Han Lau

    Abstract: An adversarial example is an input transformed by small perturbations that machine learning models consistently misclassify. While there are a number of methods proposed to generate adversarial examples for text data, it is not trivial to assess the quality of these adversarial examples, as minor perturbations (such as changing a word in a sentence) can lead to a significant shift in their meaning… ▽ More

    Submitted 1 June, 2020; v1 submitted 21 January, 2020; originally announced January 2020.

  43. arXiv:1911.06919  [pdf, other

    cs.CL cs.LG

    Improved Document Modelling with a Neural Discourse Parser

    Authors: Fajri Koto, Jey Han Lau, Timothy Baldwin

    Abstract: Despite the success of attention-based neural models for natural language generation and classification tasks, they are unable to capture the discourse structure of larger documents. We hypothesize that explicit discourse representations have utility for NLP tasks over longer documents or document sequences, which sequence-to-sequence models are unable to capture. For abstractive summarization, fo… ▽ More

    Submitted 15 November, 2019; originally announced November 2019.

  44. arXiv:1911.00202  [pdf, other

    cs.CL cs.LG

    Forget Me Not: Reducing Catastrophic Forgetting for Domain Adaptation in Reading Comprehension

    Authors: Y. Xu, X. Zhong, A. J. J. Yepes, J. H. Lau

    Abstract: The creation of large-scale open domain reading comprehension data sets in recent years has enabled the development of end-to-end neural comprehension models with promising results. To use these models for domains with limited training data, one of the most effective approach is to first pretrain them on large out-of-domain source data and then fine-tune them with the limited target data. The cave… ▽ More

    Submitted 19 November, 2020; v1 submitted 1 November, 2019; originally announced November 2019.

  45. arXiv:1807.03491  [pdf, other

    cs.CL

    Deep-speare: A Joint Neural Model of Poetic Language, Meter and Rhyme

    Authors: Jey Han Lau, Trevor Cohn, Timothy Baldwin, Julian Brooke, Adam Hammond

    Abstract: In this paper, we propose a joint architecture that captures language, rhyme and meter for sonnet modelling. We assess the quality of generated poems using crowd and expert judgements. The stress and rhyme models perform very well, as generated poems are largely indistinguishable from human-written poems. Expert evaluation, however, reveals that a vanilla language model captures meter implicitly,… ▽ More

    Submitted 10 July, 2018; originally announced July 2018.

    Comments: 11 pages; ACL2018

    Journal ref: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018)

  46. arXiv:1806.01351  [pdf, other

    cs.CL cs.CY cs.IR

    Document Chunking and Learning Objective Generation for Instruction Design

    Authors: Khoi-Nguyen Tran, Jey Han Lau, Danish Contractor, Utkarsh Gupta, Bikram Sengupta, Christopher J. Butler, Mukesh Mohania

    Abstract: Instructional Systems Design is the practice of creating of instructional experiences that make the acquisition of knowledge and skill more efficient, effective, and appealing. Specifically in designing courses, an hour of training material can require between 30 to 500 hours of effort in sourcing and organizing reference data for use in just the preparation of course material. In this paper, we p… ▽ More

    Submitted 5 August, 2018; v1 submitted 1 June, 2018; originally announced June 2018.

    Comments: Proceedings of the 11th International Conference on Education Data Mining (EDM 2018)

  47. arXiv:1804.04211  [pdf, other

    cs.CL

    Evaluating Word Embedding Hyper-Parameters for Similarity and Analogy Tasks

    Authors: Maryam Fanaeepour, Adam Makarucha, Jey Han Lau

    Abstract: The versatility of word embeddings for various applications is attracting researchers from various fields. However, the impact of hyper-parameters when training embedding model is often poorly understood. How much do hyper-parameters such as vector dimensions and corpus size affect the quality of embeddings, and how do these results translate to downstream applications? Using standard embedding ev… ▽ More

    Submitted 11 April, 2018; originally announced April 2018.

    Comments: 8 pages, 16 figures

  48. arXiv:1710.04802  [pdf, other

    cs.CL

    End-to-end Network for Twitter Geolocation Prediction and Hashing

    Authors: Jey Han Lau, Lianhua Chi, Khoi-Nguyen Tran, Trevor Cohn

    Abstract: We propose an end-to-end neural network to predict the geolocation of a tweet. The network takes as input a number of raw Twitter metadata such as the tweet message and associated user account information. Our model is language independent, and despite minimal feature engineering, it is interpretable and capable of learning location indicative words and timing patterns. Compared to state-of-the-ar… ▽ More

    Submitted 13 October, 2017; originally announced October 2017.

    Comments: 10 pages, in Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP 2017) (to appear)

  49. arXiv:1706.05140  [pdf, other

    cs.CL

    An Automatic Approach for Document-level Topic Model Evaluation

    Authors: Shraey Bhatia, Jey Han Lau, Timothy Baldwin

    Abstract: Topic models jointly learn topics and document-level topic distribution. Extrinsic evaluation of topic models tends to focus exclusively on topic-level evaluation, e.g. by assessing the coherence of topics. We demonstrate that there can be large discrepancies between topic- and document-level model quality, and that basing model evaluation on topic-level analysis can be highly misleading. We propo… ▽ More

    Submitted 15 June, 2017; originally announced June 2017.

    Comments: 10 pages; accepted for the Twenty First Conference on Computational Natural Language Learning (CoNLL 2017)

  50. arXiv:1704.08012  [pdf, other

    cs.CL

    Topically Driven Neural Language Model

    Authors: Jey Han Lau, Timothy Baldwin, Trevor Cohn

    Abstract: Language models are typically applied at the sentence level, without access to the broader document context. We present a neural language model that incorporates document context in the form of a topic model-like architecture, thus providing a succinct representation of the broader document context outside of the current sentence. Experiments over a range of datasets demonstrate that our model out… ▽ More

    Submitted 2 May, 2017; v1 submitted 26 April, 2017; originally announced April 2017.

    Comments: 11 pages, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017) (to appear)

    Journal ref: In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), pp. 355--365