Search | arXiv e-print repository

O2D2: Out-Of-Distribution Detector to Capture Undecidable Trials in Authorship Verification

Authors: Benedikt Boenninghoff, Robert M. Nickel, Dorothea Kolossa

Abstract: The PAN 2021 authorship verification (AV) challenge is part of a three-year strategy, moving from a cross-topic/closed-set AV task to a cross-topic/open-set AV task over a collection of fanfiction texts. In this work, we present a novel hybrid neural-probabilistic framework that is designed to tackle the challenges of the 2021 task. Our system is based on our 2020 winning submission, with updates… ▽ More The PAN 2021 authorship verification (AV) challenge is part of a three-year strategy, moving from a cross-topic/closed-set AV task to a cross-topic/open-set AV task over a collection of fanfiction texts. In this work, we present a novel hybrid neural-probabilistic framework that is designed to tackle the challenges of the 2021 task. Our system is based on our 2020 winning submission, with updates to significantly reduce sensitivities to topical variations and to further improve the system's calibration by means of an uncertainty-adaptation layer. Our framework additionally includes an out-of-distribution detector (O2D2) for defining non-responses. Our proposed system outperformed all other systems that participated in the PAN 2021 AV task. △ Less

Submitted 30 July, 2021; v1 submitted 30 June, 2021; originally announced June 2021.

Comments: PAN@CLEF 2021

arXiv:2106.11196 [pdf, other]

Self-Calibrating Neural-Probabilistic Model for Authorship Verification Under Covariate Shift

Authors: Benedikt Boenninghoff, Dorothea Kolossa, Robert M. Nickel

Abstract: We are addressing two fundamental problems in authorship verification (AV): Topic variability and miscalibration. Variations in the topic of two disputed texts are a major cause of error for most AV systems. In addition, it is observed that the underlying probability estimates produced by deep learning AV mechanisms oftentimes do not match the actual case counts in the respective training data. As… ▽ More We are addressing two fundamental problems in authorship verification (AV): Topic variability and miscalibration. Variations in the topic of two disputed texts are a major cause of error for most AV systems. In addition, it is observed that the underlying probability estimates produced by deep learning AV mechanisms oftentimes do not match the actual case counts in the respective training data. As such, probability estimates are poorly calibrated. We are expanding our framework from PAN 2020 to include Bayes factor scoring (BFS) and an uncertainty adaptation layer (UAL) to address both problems. Experiments with the 2020/21 PAN AV shared task data show that the proposed method significantly reduces sensitivities to topical variations and significantly improves the system's calibration. △ Less

Submitted 21 June, 2021; originally announced June 2021.

Comments: 12th International Conference of the CLEF Association, 2021

arXiv:2103.01173 [pdf, ps, other]

Unsupervised Classification of Voiced Speech and Pitch Tracking Using Forward-Backward Kalman Filtering

Authors: Benedikt Boenninghoff, Robert M. Nickel, Steffen Zeiler, Dorothea Kolossa

Abstract: The detection of voiced speech, the estimation of the fundamental frequency, and the tracking of pitch values over time are crucial subtasks for a variety of speech processing techniques. Many different algorithms have been developed for each of the three subtasks. We present a new algorithm that integrates the three subtasks into a single procedure. The algorithm can be applied to pre-recorded sp… ▽ More The detection of voiced speech, the estimation of the fundamental frequency, and the tracking of pitch values over time are crucial subtasks for a variety of speech processing techniques. Many different algorithms have been developed for each of the three subtasks. We present a new algorithm that integrates the three subtasks into a single procedure. The algorithm can be applied to pre-recorded speech utterances in the presence of considerable amounts of background noise. We combine a collection of standard metrics, such as the zero-crossing rate, for example, to formulate an unsupervised voicing classifier. The estimation of pitch values is accomplished with a hybrid autocorrelation-based technique. We propose a forward-backward Kalman filter to smooth the estimated pitch contour. In experiments, we are able to show that the proposed method compares favorably with current, state-of-the-art pitch detection algorithms. △ Less

Submitted 1 March, 2021; originally announced March 2021.

Comments: Speech Communication; 12. ITG Symposium, 5-7 Oct. 2016

arXiv:2008.10105 [pdf, other]

Deep Bayes Factor Scoring for Authorship Verification

Authors: Benedikt Boenninghoff, Julian Rupp, Robert M. Nickel, Dorothea Kolossa

Abstract: The PAN 2020 authorship verification (AV) challenge focuses on a cross-topic/closed-set AV task over a collection of fanfiction texts. Fanfiction is a fan-written extension of a storyline in which a so-called fandom topic describes the principal subject of the document. The data provided in the PAN 2020 AV task is quite challenging because authors of texts across multiple/different fandom topics a… ▽ More The PAN 2020 authorship verification (AV) challenge focuses on a cross-topic/closed-set AV task over a collection of fanfiction texts. Fanfiction is a fan-written extension of a storyline in which a so-called fandom topic describes the principal subject of the document. The data provided in the PAN 2020 AV task is quite challenging because authors of texts across multiple/different fandom topics are included. In this work, we present a hierarchical fusion of two well-known approaches into a single end-to-end learning procedure: A deep metric learning framework at the bottom aims to learn a pseudo-metric that maps a document of variable length onto a fixed-sized feature vector. At the top, we incorporate a probabilistic layer to perform Bayes factor scoring in the learned metric space. We also provide text preprocessing strategies to deal with the cross-topic issue. △ Less

Submitted 23 August, 2020; originally announced August 2020.

Comments: CLEF 2020 Labs and Workshops, Notebook Papers, September 2020. CEUR-WS.org

arXiv:2005.13930 [pdf, other]

Variational Autoencoder with Embedded Student-$t$ Mixture Model for Authorship Attribution

Authors: Benedikt Boenninghoff, Steffen Zeiler, Robert M. Nickel, Dorothea Kolossa

Abstract: Traditional computational authorship attribution describes a classification task in a closed-set scenario. Given a finite set of candidate authors and corresponding labeled texts, the objective is to determine which of the authors has written another set of anonymous or disputed texts. In this work, we propose a probabilistic autoencoding framework to deal with this supervised classification task.… ▽ More Traditional computational authorship attribution describes a classification task in a closed-set scenario. Given a finite set of candidate authors and corresponding labeled texts, the objective is to determine which of the authors has written another set of anonymous or disputed texts. In this work, we propose a probabilistic autoencoding framework to deal with this supervised classification task. More precisely, we are extending a variational autoencoder (VAE) with embedded Gaussian mixture model to a Student-$t$ mixture model. Autoencoders have had tremendous success in learning latent representations. However, existing VAEs are currently still bound by limitations imposed by the assumed Gaussianity of the underlying probability distributions in the latent space. In this work, we are extending the Gaussian model for the VAE to a Student-$t$ model, which allows for an independent control of the "heaviness" of the respective tails of the implied probability densities. Experiments over an Amazon review dataset indicate superior performance of the proposed method. △ Less

Submitted 28 May, 2020; originally announced May 2020.

Comments: Preprint

arXiv:1910.08144 [pdf, ps, other]

Explainable Authorship Verification in Social Media via Attention-based Similarity Learning

Authors: Benedikt Boenninghoff, Steffen Hessler, Dorothea Kolossa, Robert M. Nickel

Abstract: Authorship verification is the task of analyzing the linguistic patterns of two or more texts to determine whether they were written by the same author or not. The analysis is traditionally performed by experts who consider linguistic features, which include spelling mistakes, grammatical inconsistencies, and stylistics for example. Machine learning algorithms, on the other hand, can be trained to… ▽ More Authorship verification is the task of analyzing the linguistic patterns of two or more texts to determine whether they were written by the same author or not. The analysis is traditionally performed by experts who consider linguistic features, which include spelling mistakes, grammatical inconsistencies, and stylistics for example. Machine learning algorithms, on the other hand, can be trained to accomplish the same, but have traditionally relied on so-called stylometric features. The disadvantage of such features is that their reliability is greatly diminished for short and topically varied social media texts. In this interdisciplinary work, we propose a substantial extension of a recently published hierarchical Siamese neural network approach, with which it is feasible to learn neural features and to visualize the decision-making process. For this purpose, a new large-scale corpus of short Amazon reviews for text comparison research is compiled and we show that the Siamese network topologies outperform state-of-the-art approaches that were built up on stylometric features. Our linguistic analysis of the internal attention weights of the network shows that the proposed method is indeed able to latch on to some traditional linguistic categories. △ Less

Submitted 19 November, 2019; v1 submitted 17 October, 2019; originally announced October 2019.

Comments: Accepted for 2019 IEEE International Conference on Big Data (IEEE Big Data 2019)

arXiv:1908.07844 [pdf, ps, other]

doi 10.1109/ICASSP.2019.8683405

Similarity Learning for Authorship Verification in Social Media

Authors: Benedikt Boenninghoff, Robert M. Nickel, Steffen Zeiler, Dorothea Kolossa

Abstract: Authorship verification tries to answer the question if two documents with unknown authors were written by the same author or not. A range of successful technical approaches has been proposed for this task, many of which are based on traditional linguistic features such as n-grams. These algorithms achieve good results for certain types of written documents like books and novels. Forensic authorsh… ▽ More Authorship verification tries to answer the question if two documents with unknown authors were written by the same author or not. A range of successful technical approaches has been proposed for this task, many of which are based on traditional linguistic features such as n-grams. These algorithms achieve good results for certain types of written documents like books and novels. Forensic authorship verification for social media, however, is a much more challenging task since messages tend to be relatively short, with a large variety of different genres and topics. At this point, traditional methods based on features like n-grams have had limited success. In this work, we propose a new neural network topology for similarity learning that significantly improves the performance on the author verification task with such challenging data sets. △ Less

Submitted 20 August, 2019; originally announced August 2019.

Comments: 5 pages, 3 figures, 1 table, presented on ICASSP 2019 in Brighton, UK

Showing 1–7 of 7 results for author: Nickel, R M