Skip to main content

Showing 1–44 of 44 results for author: Derczynski, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.11704  [pdf, other

    cs.CL cs.AI cs.LG

    Nemotron-4 340B Technical Report

    Authors: Nvidia, :, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz Grzegorzek , et al. (58 additional authors not shown)

    Abstract: We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distribution, modification, and use of the models and its outputs. These models perform competitively to open access models on a wide range of evaluation be… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  2. arXiv:2406.11036  [pdf, other

    cs.CL cs.CR

    garak: A Framework for Security Probing Large Language Models

    Authors: Leon Derczynski, Erick Galinkin, Jeffrey Martin, Subho Majumdar, Nanna Inie

    Abstract: As Large Language Models (LLMs) are deployed and integrated into thousands of applications, the need for scalable evaluation of how models respond to adversarial attacks grows rapidly. However, LLM security is a moving target: models produce unpredictable output, are constantly updated, and the potential adversary is highly diverse: anyone with access to the internet and a decent command of natura… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: https://garak.ai

  3. arXiv:2404.12241  [pdf, other

    cs.CL cs.AI

    Introducing v0.5 of the AI Safety Benchmark from MLCommons

    Authors: Bertie Vidgen, Adarsh Agrawal, Ahmed M. Ahmed, Victor Akinwande, Namir Al-Nuaimi, Najla Alfaraj, Elie Alhajjar, Lora Aroyo, Trupti Bavalatti, Max Bartolo, Borhane Blili-Hamelin, Kurt Bollacker, Rishi Bomassani, Marisa Ferrara Boston, Siméon Campos, Kal Chakra, Canyu Chen, Cody Coleman, Zacharie Delpierre Coudert, Leon Derczynski, Debojyoti Dutta, Ian Eisenberg, James Ezick, Heather Frase, Brian Fuller , et al. (75 additional authors not shown)

    Abstract: This paper introduces v0.5 of the AI Safety Benchmark, which has been created by the MLCommons AI Safety Working Group. The AI Safety Benchmark has been designed to assess the safety risks of AI systems that use chat-tuned language models. We introduce a principled approach to specifying and constructing the benchmark, which for v0.5 covers only a single use case (an adult chatting to a general-pu… ▽ More

    Submitted 13 May, 2024; v1 submitted 18 April, 2024; originally announced April 2024.

  4. arXiv:2311.06237  [pdf, other

    cs.CL cs.CR cs.HC

    Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild

    Authors: Nanna Inie, Jonathan Stray, Leon Derczynski

    Abstract: Engaging in the deliberate generation of abnormal outputs from large language models (LLMs) by attacking them is a novel human activity. This paper presents a thorough exposition of how and why people perform such attacks. Using a formal qualitative methodology, we interviewed dozens of practitioners from a broad range of backgrounds, all contributors to this novel work of attempting to cause LLMs… ▽ More

    Submitted 13 November, 2023; v1 submitted 10 November, 2023; originally announced November 2023.

  5. arXiv:2306.16900  [pdf, other

    cs.CL

    Surveying (Dis)Parities and Concerns of Compute Hungry NLP Research

    Authors: Ji-Ung Lee, Haritz Puerto, Betty van Aken, Yuki Arase, Jessica Zosa Forde, Leon Derczynski, Andreas Rücklé, Iryna Gurevych, Roy Schwartz, Emma Strubell, Jesse Dodge

    Abstract: Many recent improvements in NLP stem from the development and use of large pre-trained language models (PLMs) with billions of parameters. Large model sizes makes computational cost one of the main limiting factors for training and evaluating such models; and has raised severe concerns about the sustainability, reproducibility, and inclusiveness for researching PLMs. These concerns are often based… ▽ More

    Submitted 9 November, 2023; v1 submitted 29 June, 2023; originally announced June 2023.

  6. arXiv:2303.18190  [pdf, other

    cs.CL

    Assessing Language Model Deployment with Risk Cards

    Authors: Leon Derczynski, Hannah Rose Kirk, Vidhisha Balachandran, Sachin Kumar, Yulia Tsvetkov, M. R. Leiser, Saif Mohammad

    Abstract: This paper introduces RiskCards, a framework for structured assessment and documentation of risks associated with an application of language models. As with all language, text generated by language models can be harmful, or used to bring about harm. Automating language generation adds both an element of scale and also more subtle or emergent undesirable tendencies to the generated text. Prior work… ▽ More

    Submitted 31 March, 2023; originally announced March 2023.

  7. arXiv:2209.00099  [pdf, other

    cs.CL

    Efficient Methods for Natural Language Processing: A Survey

    Authors: Marcos Treviso, Ji-Ung Lee, Tianchu Ji, Betty van Aken, Qingqing Cao, Manuel R. Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Colin Raffel, Pedro H. Martins, André F. T. Martins, Jessica Zosa Forde, Peter Milder, Edwin Simpson, Noam Slonim, Jesse Dodge, Emma Strubell, Niranjan Balasubramanian, Leon Derczynski, Iryna Gurevych, Roy Schwartz

    Abstract: Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require few… ▽ More

    Submitted 24 March, 2023; v1 submitted 31 August, 2022; originally announced September 2022.

    Comments: Accepted at TACL, pre publication version

  8. arXiv:2208.12097  [pdf, other

    cs.CL

    Training a T5 Using Lab-sized Resources

    Authors: Manuel R. Ciosici, Leon Derczynski

    Abstract: Training large neural language models on large datasets is resource- and time-intensive. These requirements create a barrier to entry, where those with fewer resources cannot build competitive models. This paper presents various techniques for making it possible to (a) train a large language model using resources that a modest research lab might have, and (b) train it in a reasonable amount of tim… ▽ More

    Submitted 25 August, 2022; originally announced August 2022.

  9. arXiv:2208.06161  [pdf, other

    cs.CL

    Sparse Probability of Agreement

    Authors: Jeppe Nørregaard, Leon Derczynski

    Abstract: Measuring inter-annotator agreement is important for annotation tasks, but many metrics require a fully-annotated set of data, where all annotators annotate all samples. We define Sparse Probability of Agreement, SPA, which estimates the probability of agreement when not all annotator-item-pairs are available. We show that under certain conditions, SPA is an unbiased estimator, and we provide mult… ▽ More

    Submitted 24 February, 2023; v1 submitted 12 August, 2022; originally announced August 2022.

  10. arXiv:2206.08727  [pdf, other

    cs.CL

    The ITU Faroese Pairs Dataset

    Authors: Leon Derczynski, Annika Solveig Hedegaard Isfeldt, Signhild Djurhuus

    Abstract: This article documents a dataset of sentence pairs between Faroese and Danish, produced at ITU Copenhagen. The data covers tranlsation from both source languages, and is intended for use as training data for machine translation systems in this language pair.

    Submitted 17 June, 2022; originally announced June 2022.

  11. arXiv:2206.03720  [pdf

    cs.LG cs.CL

    Set Interdependence Transformer: Set-to-Sequence Neural Networks for Permutation Learning and Structure Prediction

    Authors: Mateusz Jurewicz, Leon Derczynski

    Abstract: The task of learning to map an input set onto a permuted sequence of its elements is challenging for neural networks. Set-to-sequence problems occur in natural language processing, computer vision and structure prediction, where interactions between elements of large sets define the optimal output. Models must exhibit relational reasoning, handle varying cardinalities and manage combinatorial comp… ▽ More

    Submitted 8 June, 2022; originally announced June 2022.

    Comments: Paper accepted for publication in the IJCAI-ECAI 2022 proceedings: https://www.ijcai.org/proceedings/

  12. arXiv:2205.03153  [pdf, other

    cs.CL

    Bridging the Domain Gap for Stance Detection for the Zulu language

    Authors: Gcinizwe Dlamini, Imad Eddine Ibrahim Bekkouch, Adil Khan, Leon Derczynski

    Abstract: Misinformation has become a major concern in recent last years given its spread across our information sources. In the past years, many NLP tasks have been introduced in this area, with some systems reaching good results on English language datasets. Existing AI based approaches for fighting misinformation in literature suggest automatic stance detection as an integral first step to success. Our p… ▽ More

    Submitted 6 May, 2022; originally announced May 2022.

    Comments: accepted to Intellisys

  13. arXiv:2204.14256  [pdf, other

    cs.CL

    Handling and Presenting Harmful Text in NLP Research

    Authors: Hannah Rose Kirk, Abeba Birhane, Bertie Vidgen, Leon Derczynski

    Abstract: Text data can pose a risk of harm. However, the risks are not fully understood, and how to handle, present, and discuss harmful text in a safe way remains an unresolved issue in the NLP community. We provide an analytical framework categorising harms on three axes: (1) the harm type (e.g., misinformation, hate speech or racial stereotypes); (2) whether a harm is \textit{sought} as a feature of the… ▽ More

    Submitted 24 February, 2023; v1 submitted 29 April, 2022; originally announced April 2022.

    Comments: in Findings of EMNLP 2022

  14. arXiv:2107.13592  [pdf

    cs.CL

    Detecting Abusive Albanian

    Authors: Erida Nurce, Jorgel Keci, Leon Derczynski

    Abstract: The ever growing usage of social media in the recent years has had a direct impact on the increased presence of hate speech and offensive speech in online platforms. Research on effective detection of such content has mainly focused on English and a few other widespread languages, while the leftover majority fail to have the same work put into them and thus cannot benefit from the steady advanceme… ▽ More

    Submitted 10 May, 2022; v1 submitted 28 July, 2021; originally announced July 2021.

  15. arXiv:2104.07951  [pdf, other

    cs.CL

    Optimal Size-Performance Tradeoffs: Weighing PoS Tagger Models

    Authors: Magnus Jacobsen, Mikkel H. Sørensen, Leon Derczynski

    Abstract: Improvement in machine learning-based NLP performance are often presented with bigger models and more complex code. This presents a trade-off: better scores come at the cost of larger tools; bigger models tend to require more during training and inference time. We present multiple methods for measuring the size of a model, and for comparing this with the model's performance. In a case study over… ▽ More

    Submitted 16 April, 2021; originally announced April 2021.

  16. arXiv:2012.06431  [pdf, other

    cs.CL

    Discriminating Between Similar Nordic Languages

    Authors: René Haas, Leon Derczynski

    Abstract: Automatic language identification is a challenging problem. Discriminating between closely related languages is especially difficult. This paper presents a machine learning approach for automatic language identification for the Nordic languages, which often suffer miscategorisation by existing state-of-the-art tools. Concretely we will focus on discrimination between six Nordic languages: Danish,… ▽ More

    Submitted 23 March, 2023; v1 submitted 11 December, 2020; originally announced December 2020.

    Comments: Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

  17. arXiv:2006.07237  [pdf, other

    cs.LG cs.NE stat.ML

    Power Consumption Variation over Activation Functions

    Authors: Leon Derczynski

    Abstract: The power that machine learning models consume when making predictions can be affected by a model's architecture. This paper presents various estimates of power consumption for a range of different activation functions, a core factor in neural network model architecture design. Substantial differences in hardware performance exist between activation functions. This difference informs how power con… ▽ More

    Submitted 12 June, 2020; originally announced June 2020.

  18. arXiv:2006.07235  [pdf, ps, other

    cs.CL

    SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)

    Authors: Marcos Zampieri, Preslav Nakov, Sara Rosenthal, Pepa Atanasova, Georgi Karadzhov, Hamdy Mubarak, Leon Derczynski, Zeses Pitenis, Çağrı Çöltekin

    Abstract: We present the results and main findings of SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval 2020). The task involves three subtasks corresponding to the hierarchical taxonomy of the OLID schema (Zampieri et al., 2019a) from OffensEval 2019. The task featured five languages: English, Arabic, Danish, Greek, and Turkish for Subtask A. In addition, En… ▽ More

    Submitted 30 September, 2020; v1 submitted 12 June, 2020; originally announced June 2020.

    Comments: Proceedings of the International Workshop on Semantic Evaluation (SemEval-2020)

    MSC Class: 68T50; 68T07 ACM Class: I.2.7

  19. Directions in Abusive Language Training Data: Garbage In, Garbage Out

    Authors: Bertie Vidgen, Leon Derczynski

    Abstract: Data-driven analysis and detection of abusive online content covers many different tasks, phenomena, contexts, and methodologies. This paper systematically reviews abusive language dataset creation and content in conjunction with an open website for cataloguing abusive language data. This collection of knowledge leads to a synthesis providing evidence-based recommendations for practitioners workin… ▽ More

    Submitted 19 July, 2021; v1 submitted 3 April, 2020; originally announced April 2020.

    Comments: 26 pages, 5 figures

    Journal ref: PLoS ONE 15(12): e0243300

  20. The Rumour Mill: Making the Spread of Misinformation Explicit and Tangible

    Authors: Nanna Inie, Jeanette Falk Olesen, Leon Derczynski

    Abstract: Misinformation spread presents a technological and social threat to society. With the advance of AI-based language models, automatically generated texts have become difficult to identify and easy to create at scale. We present "The Rumour Mill", a playful art piece, designed as a commentary on the spread of rumours and automatically-generated misinformation. The mill is a tabletop interactive mach… ▽ More

    Submitted 16 February, 2020; v1 submitted 11 February, 2020; originally announced February 2020.

    Comments: Accepted to CHI 2020 Interactivity

  21. arXiv:1908.04531  [pdf, ps, other

    cs.CL

    Offensive Language and Hate Speech Detection for Danish

    Authors: Gudbjartur Ingi Sigurbergsson, Leon Derczynski

    Abstract: The presence of offensive language on social media platforms and the implications this poses is becoming a major concern in modern society. Given the enormous amount of content created every day, automatic methods are required to detect and deal with this type of content. Until now, most of the research has focused on solving the problem for the English language, while the problem is multilingual.… ▽ More

    Submitted 23 March, 2023; v1 submitted 13 August, 2019; originally announced August 2019.

    Comments: Proceedings of the Twelfth Language Resources and Evaluation Conference

  22. arXiv:1906.11608  [pdf, ps, other

    cs.CL

    Simple Natural Language Processing Tools for Danish

    Authors: Leon Derczynski

    Abstract: This technical note describes a set of baseline tools for automatic processing of Danish text. The tools are machine-learning based, using natural language processing models trained over previously annotated documents. They are maintained at ITU Copenhagen and will always be freely available.

    Submitted 26 July, 2019; v1 submitted 27 June, 2019; originally announced June 2019.

  23. arXiv:1809.06683  [pdf, other

    cs.CL

    RumourEval 2019: Determining Rumour Veracity and Support for Rumours

    Authors: Genevieve Gorrell, Kalina Bontcheva, Leon Derczynski, Elena Kochkina, Maria Liakata, Arkaitz Zubiaga

    Abstract: This is the proposal for RumourEval-2019, which will run in early 2019 as part of that year's SemEval event. Since the first RumourEval shared task in 2017, interest in automated claim validation has greatly increased, as the dangers of "fake news" have become a mainstream concern. Yet automated support for rumour checking remains in its infancy. For this reason, it is important that a shared task… ▽ More

    Submitted 18 September, 2018; originally announced September 2018.

  24. Stance Prediction for Russian: Data and Analysis

    Authors: Nikita Lozhnikov, Leon Derczynski, Manuel Mazzara

    Abstract: Stance detection is a critical component of rumour and fake news identification. It involves the extraction of the stance a particular author takes related to a given claim, both expressed in text. This paper investigates stance classification for Russian. It introduces a new dataset, RuStance, of Russian tweets and news comments from multiple sources, covering multiple stories, as well as text cl… ▽ More

    Submitted 3 October, 2018; v1 submitted 5 September, 2018; originally announced September 2018.

  25. arXiv:1801.09633  [pdf, other

    cs.CL

    Hel** Crisis Responders Find the Informative Needle in the Tweet Haystack

    Authors: Leon Derczynski, Kenny Meesters, Kalina Bontcheva, Diana Maynard

    Abstract: Crisis responders are increasingly using social media, data and other digital sources of information to build a situational understanding of a crisis situation in order to design an effective response. However with the increased availability of such data, the challenge of identifying relevant information from it also increases. This paper presents a successful automatic approach to handling this p… ▽ More

    Submitted 29 January, 2018; originally announced January 2018.

    Journal ref: Proc. 15th International Conference on Information Systems for Crisis Response and Management (ISCRAM), 2018, pp. 649-662. ISBN 9780692127605

  26. arXiv:1712.08349  [pdf, other

    cs.CL cs.SI

    Tracking the Diffusion of Named Entities

    Authors: Leon Derczynski, Matthew Rowe

    Abstract: Existing studies of how information diffuses across social networks have thus far concentrated on analysing and recovering the spread of deterministic innovations such as URLs, hashtags, and group membership. However investigating how mentions of real-world entities appear and spread has yet to be explored, largely due to the computationally intractable nature of performing large-scale entity extr… ▽ More

    Submitted 29 December, 2017; v1 submitted 22 December, 2017; originally announced December 2017.

  27. arXiv:1708.05286  [pdf, other

    cs.CL

    Simple Open Stance Classification for Rumour Analysis

    Authors: Ahmet Aker, Leon Derczynski, Kalina Bontcheva

    Abstract: Stance classification determines the attitude, or stance, in a (typically short) text. The task has powerful applications, such as the detection of fake news or the automatic extraction of attitudes toward entities or events in the media. This paper describes a surprisingly simple and efficient classification approach to open stance classification in Twitter, for rumour and veracity classification… ▽ More

    Submitted 14 September, 2017; v1 submitted 17 August, 2017; originally announced August 2017.

    Journal ref: In RANLP 2017

  28. arXiv:1704.05972  [pdf, ps, other

    cs.CL cs.AI

    SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours

    Authors: Leon Derczynski, Kalina Bontcheva, Maria Liakata, Rob Procter, Geraldine Wong Sak Hoi, Arkaitz Zubiaga

    Abstract: Media is full of false claims. Even Oxford Dictionaries named "post-truth" as the word of 2016. This makes it more important than ever to build systems that can identify the veracity of a story, and the kind of discourse there is around it. RumourEval is a SemEval shared task that aims to identify and handle rumours and reactions to them, in text. We present an annotation scheme, a large dataset c… ▽ More

    Submitted 19 April, 2017; originally announced April 2017.

  29. arXiv:1701.02877  [pdf, other

    cs.CL

    Generalisation in Named Entity Recognition: A Quantitative Analysis

    Authors: Isabelle Augenstein, Leon Derczynski, Kalina Bontcheva

    Abstract: Named Entity Recognition (NER) is a key NLP task, which is all the more challenging on Web and user-generated content with their diverse and continuously changing language. This paper aims to quantify how this diversity impacts state-of-the-art NER methods, by measuring named entity (NE) and context variability, feature sparsity, and their effects on precision and recall. In particular, our findin… ▽ More

    Submitted 7 March, 2017; v1 submitted 11 January, 2017; originally announced January 2017.

    Comments: Preprint, accepted to Computer Speech and Language

  30. arXiv:1608.02094  [pdf, ps, other

    cs.CL

    Desiderata for Vector-Space Word Representations

    Authors: Leon Derczynski

    Abstract: A plethora of vector-space representations for words is currently available, which is growing. These consist of fixed-length vectors containing real values, which represent a word. The result is a representation upon which the power of many conventional information processing and data mining techniques can be brought to bear, as long as the representations are designed with some forethought and fi… ▽ More

    Submitted 6 August, 2016; originally announced August 2016.

  31. arXiv:1511.03088  [pdf, ps, other

    cs.CL

    USFD: Twitter NER with Drift Compensation and Linked Data

    Authors: Leon Derczynski, Isabelle Augenstein, Kalina Bontcheva

    Abstract: This paper describes a pilot NER system for Twitter, comprising the USFD system entry to the W-NUT 2015 NER shared task. The goal is to correctly label entities in a tweet dataset, using an inventory of ten types. We employ structured learning, drawing on gazetteers taken from Linked Data, and on unsupervised clustering features, and attempting to compensate for stylistic and topic drift - a key c… ▽ More

    Submitted 10 November, 2015; originally announced November 2015.

    Comments: Paper in ACL anthology: https://aclweb.org/anthology/W/W15/W15-4306.bib

    Journal ref: Proceedings of the ACL Workshop on Noisy User-generated Text (2015), pp. 48--53

  32. Analysis of Named Entity Recognition and Linking for Tweets

    Authors: Leon Derczynski, Diana Maynard, Giuseppe Rizzo, Marieke van Erp, Genevieve Gorrell, Raphaël Troncy, Johann Petrak, Kalina Bontcheva

    Abstract: Applying natural language processing for mining and intelligent information access to tweets (a form of microblog) is a challenging, emerging research area. Unlike carefully authored news text and other longer content, tweets pose a number of new challenges, due to their short, noisy, context-dependent, and dynamic nature. Information extraction from tweets is typically performed in a pipeline, co… ▽ More

    Submitted 27 October, 2014; originally announced October 2014.

    Comments: 35 pages, accepted to journal Information Processing and Management

    Journal ref: Information Processing & Management 51 (2), 32-49, 2014

  33. arXiv:1403.4928  [pdf, ps, other

    cs.CL

    Clinical TempEval

    Authors: Steven Bethard, Leon Derczynski, James Pustejovsky, Marc Verhagen

    Abstract: We describe the Clinical TempEval task which is currently in preparation for the SemEval-2015 evaluation exercise. This task involves identifying and describing events, times and the relations between them in clinical text. Six discrete subtasks are included, focusing on recognising mentions of times and events, describing those mentions for both entity types, identifying the relation between an e… ▽ More

    Submitted 19 March, 2014; originally announced March 2014.

  34. arXiv:1304.7289  [pdf, ps, other

    cs.CL

    TimeML-strict: clarifying temporal annotation

    Authors: Leon Derczynski, Hector Llorens, Naushad UzZaman

    Abstract: TimeML is an XML-based schema for annotating temporal information over discourse. The standard has been used to annotate a variety of resources and is followed by a number of tools, the creation of which constitute hundreds of thousands of man-hours of research work. However, the current state of resources is such that many are not valid, or do not produce valid output, or contain ambiguous or cus… ▽ More

    Submitted 26 April, 2013; originally announced April 2013.

    ACM Class: I.2.7

  35. arXiv:1304.7157  [pdf, ps, other

    cs.CL cs.IR

    Question Answering Against Very-Large Text Collections

    Authors: Leon Derczynski, Richard Shaw, Ben Solway, Jun Wang

    Abstract: Question answering involves develo** methods to extract useful information from large collections of documents. This is done with specialised search engines such as Answer Finder. The aim of Answer Finder is to provide an answer to a question rather than a page listing related documents that may contain the correct answer. So, a question such as "How tall is the Eiffel Tower" would simply return… ▽ More

    Submitted 26 April, 2013; originally announced April 2013.

    Journal ref: Master's theses, 2008, University of Sheffield

  36. arXiv:1206.5333  [pdf, ps, other

    cs.CL

    TempEval-3: Evaluating Events, Time Expressions, and Temporal Relations

    Authors: Naushad UzZaman, Hector Llorens, James Allen, Leon Derczynski, Marc Verhagen, James Pustejovsky

    Abstract: We describe the TempEval-3 task which is currently in preparation for the SemEval-2013 evaluation exercise. The aim of TempEval is to advance research on temporal information processing. TempEval-3 follows on from previous TempEval events, incorporating: a three-part task structure covering event, temporal expression and temporal relation extraction; a larger dataset; and single overall task quali… ▽ More

    Submitted 25 May, 2014; v1 submitted 22 June, 2012; originally announced June 2012.

  37. arXiv:1203.5084  [pdf, ps, other

    cs.CL cs.IR

    A Data Driven Approach to Query Expansion in Question Answering

    Authors: Leon Derczynski, Jun Wang, Robert Gaizauskas, Mark A. Greenwood

    Abstract: Automated answering of natural language questions is an interesting and useful problem to solve. Question answering (QA) systems often perform information retrieval at an initial stage. Information retrieval (IR) performance, provided by engines such as Lucene, places a bound on overall system performance. For example, no answer bearing documents are retrieved at low ranks for almost 40% of questi… ▽ More

    Submitted 22 March, 2012; originally announced March 2012.

    Journal ref: Proc. IR4QA Workshop (2008) 34-41

  38. arXiv:1203.5076  [pdf, ps, other

    cs.CL

    Massively Increasing TIMEX3 Resources: A Transduction Approach

    Authors: Leon Derczynski, Héctor Llorens, Estela Saquete

    Abstract: Automatic annotation of temporal expressions is a research challenge of great interest in the field of information extraction. Gold standard temporally-annotated resources are limited in size, which makes research using them difficult. Standards have also evolved over the past decade, so not all temporally annotated data is in the same format. We vastly increase available human-annotated temporal… ▽ More

    Submitted 22 March, 2012; originally announced March 2012.

    Comments: Proc. LREC (2012)

    Journal ref: Proceedings of the 8th international conference on Language Resources and Evaluation (2012), pp. 3754-3761

  39. arXiv:1203.5073  [pdf, other

    cs.CL

    USFD at KBP 2011: Entity Linking, Slot Filling and Temporal Bounding

    Authors: Amev Burman, Arun Jayapal, Sathish Kannan, Madhu Kavilikatta, Ayman Alhelbawy, Leon Derczynski, Robert Gaizauskas

    Abstract: This paper describes the University of Sheffield's entry in the 2011 TAC KBP entity linking and slot filling tasks. We chose to participate in the monolingual entity linking task, the monolingual slot filling task and the temporal slot filling tasks. We set out to build a framework for experimentation with knowledge base population. This framework was created, and applied to multiple KBP tasks. We… ▽ More

    Submitted 22 March, 2012; originally announced March 2012.

    Comments: Proc. Text Analysis Conference (2011)

  40. arXiv:1203.5066  [pdf, ps, other

    cs.CL

    A Corpus-based Study of Temporal Signals

    Authors: Leon Derczynski, Robert Gaizauskas

    Abstract: Automatic temporal ordering of events described in discourse has been of great interest in recent years. Event orderings are conveyed in text via va rious linguistic mechanisms including the use of expressions such as "before", "after" or "during" that explicitly assert a temporal relation -- temporal signals. In this paper, we investigate the role of temporal signals in temporal relation extracti… ▽ More

    Submitted 22 March, 2012; originally announced March 2012.

    Comments: Proc. Corpus Linguistics (2011)

    Journal ref: Proceedings of the 6th Conference on Corpus Linguistics (2011), No. 197, pp. 1--8

  41. arXiv:1203.5062  [pdf, ps, other

    cs.CL

    An Annotation Scheme for Reichenbach's Verbal Tense Structure

    Authors: Leon Derczynski, Robert Gaizauskas

    Abstract: In this paper we present RTMML, a markup language for the tenses of verbs and temporal relations between verbs. There is a richness to tense in language that is not fully captured by existing temporal annotation schemata. Following Reichenbach we present an analysis of tense in terms of abstract time points, with the aim of supporting automated processing of tense and temporal relations in languag… ▽ More

    Submitted 22 March, 2012; originally announced March 2012.

    Journal ref: Proc. 6th Joint ACL-ISO Workshop on Interoperable Semantic Annotation (2011) 10-17

  42. arXiv:1203.5060  [pdf, ps, other

    cs.CL

    USFD2: Annotating Temporal Expresions and TLINKs for TempEval-2

    Authors: Leon Derczynski, Robert Gaizauskas

    Abstract: We describe the University of Sheffield system used in the TempEval-2 challenge, USFD2. The challenge requires the automatic identification of temporal entities and relations in text. USFD2 identifies and anchors temporal expressions, and also attempts two of the four temporal relation assignment tasks. A rule-based system picks out and anchors temporal expressions, and a maximum entropy classifie… ▽ More

    Submitted 22 March, 2012; originally announced March 2012.

    Comments: Part of TempEval-2

    Journal ref: Proc. 5th International Workshop on Semantic Evaluation (2010) 337-340

  43. arXiv:1203.5055  [pdf, ps, other

    cs.CL

    Using Signals to Improve Automatic Classification of Temporal Relations

    Authors: Leon Derczynski, Robert Gaizauskas

    Abstract: Temporal information conveyed by language describes how the world around us changes through time. Events, durations and times are all temporal elements that can be viewed as intervals. These intervals are sometimes temporally related in text. Automatically determining the nature of such relations is a complex and unsolved problem. Some words can act as "signals" which suggest a temporal ordering b… ▽ More

    Submitted 22 March, 2012; originally announced March 2012.

  44. arXiv:1203.5051  [pdf, ps, other

    cs.CL

    Analysing Temporally Annotated Corpora with CAVaT

    Authors: Leon Derczynski, Robert Gaizauskas

    Abstract: We present CAVaT, a tool that performs Corpus Analysis and Validation for TimeML. CAVaT is an open source, modular checking utility for statistical analysis of features specific to temporally-annotated natural language corpora. It provides reporting, highlights salient links between a variety of general and time-specific linguistic features, and also validates a temporal annotation to ensure that… ▽ More

    Submitted 22 March, 2012; originally announced March 2012.

    Journal ref: Proc. LREC (2010) 398-404