Search | arXiv e-print repository

doi 10.1145/3397271.3401306

How UMass-FSD Inadvertently Leverages Temporal Bias

Abstract: First Story Detection describes the task of identifying new events in a stream of documents. The UMass-FSD system is known for its strong performance in First Story Detection competitions. Recently, it has been frequently used as a high accuracy baseline in research publications. We are the first to discover that UMass-FSD inadvertently leverages temporal bias. Interestingly, the discovered bias c… ▽ More First Story Detection describes the task of identifying new events in a stream of documents. The UMass-FSD system is known for its strong performance in First Story Detection competitions. Recently, it has been frequently used as a high accuracy baseline in research publications. We are the first to discover that UMass-FSD inadvertently leverages temporal bias. Interestingly, the discovered bias contrasts previously known biases and performs significantly better. Our analysis reveals an increased contribution of temporally distant documents, resulting from an unusual way of handling incremental term statistics. We show that this form of temporal bias is also applicable to other well-known First Story Detection systems, where it improves the detection accuracy. To provide a more generalizable conclusion and demonstrate that the observed bias is not only an artefact of a particular implementation, we present a model that intentionally leverages a bias on temporal distance. Our model significantly improves the detection effectiveness of state-of-the-art First Story Detection systems. △ Less

Submitted 2 August, 2022; originally announced August 2022.

Comments: Temporal Bias, First Story Detection, Topic Detection and Tracking, UMass-FSD, LSH-FSD

Journal ref: SIGIR 20, July 2020

arXiv:2208.01340 [pdf, ps, other]

doi 10.1145/3209978.3210101

Parameterizing Kterm Hashing

Authors: Dominik Wurzer, Yumeng Qin

Abstract: Kterm Hashing provides an innovative approach to novelty detection on massive data streams. Previous research focused on maximizing the efficiency of Kterm Hashing and succeeded in scaling First Story Detection to Twitter-size data stream without sacrificing detection accuracy. In this paper, we focus on improving the effectiveness of Kterm Hashing. Traditionally, all kterms are considered as equa… ▽ More Kterm Hashing provides an innovative approach to novelty detection on massive data streams. Previous research focused on maximizing the efficiency of Kterm Hashing and succeeded in scaling First Story Detection to Twitter-size data stream without sacrificing detection accuracy. In this paper, we focus on improving the effectiveness of Kterm Hashing. Traditionally, all kterms are considered as equally important when calculating a document's degree of novelty with respect to the past. We believe that certain kterms are more important than others and hypothesize that uniform kterm weights are sub-optimal for determining novelty in data streams. To validate our hypothesis, we parameterize Kterm Hashing by assigning weights to kterms based on their characteristics. Our experiments apply Kterm Hashing in a First Story Detection setting and reveal that parameterized Kterm Hashing can surpass state-of-the-art detection accuracy and significantly outperform the uniformly weighted approach. △ Less

Submitted 2 August, 2022; originally announced August 2022.

Comments: Kterm Hashing, Novelty Detection, First Story Detection

Journal ref: SIGIR 18, July 2018, Ann Arbor, MI, USA

arXiv:1701.01737 [pdf, other]

Spotting Information biases in Chinese and Western Media

Authors: Dominik Wurzer, Yumeng Qin

Abstract: Newswire and Social Media are the major sources of information in our time. While the topical demographic of Western Media was subjects of studies in the past, less is known about Chinese Media. In this paper, we apply event detection and tracking technology to examine the information overlap and differences between Chinese and Western - Traditional Media and Social Media. Our experiments reveal a… ▽ More Newswire and Social Media are the major sources of information in our time. While the topical demographic of Western Media was subjects of studies in the past, less is known about Chinese Media. In this paper, we apply event detection and tracking technology to examine the information overlap and differences between Chinese and Western - Traditional Media and Social Media. Our experiments reveal a biased interest of China towards the West, which becomes particularly apparent when comparing the interest in celebrities. △ Less

Submitted 6 January, 2017; originally announced January 2017.

arXiv:1611.06322 [pdf, other]

Spotting Rumors via Novelty Detection

Authors: Yumeng Qin, Dominik Wurzer, Victor Lavrenko, Cunchen Tang

Abstract: Rumour detection is hard because the most accurate systems operate retrospectively, only recognizing rumours once they have collected repeated signals. By then the rumours might have already spread and caused harm. We introduce a new category of features based on novelty, tailored to detect rumours early on. To compensate for the absence of repeated signals, we make use of news wire as an addition… ▽ More Rumour detection is hard because the most accurate systems operate retrospectively, only recognizing rumours once they have collected repeated signals. By then the rumours might have already spread and caused harm. We introduce a new category of features based on novelty, tailored to detect rumours early on. To compensate for the absence of repeated signals, we make use of news wire as an additional data source. Unconfirmed (novel) information with respect to the news articles is considered as an indication of rumours. Additionally we introduce pseudo feedback, which assumes that documents that are similar to previous rumours, are more likely to also be a rumour. Comparison with other real-time approaches shows that novelty based features in conjunction with pseudo feedback perform significantly better, when detecting rumours instantly after their publication. △ Less

Submitted 19 November, 2016; originally announced November 2016.

arXiv:1607.02641 [pdf, other]

Randomised Relevance Model

Authors: Dominik Wurzer, Miles Osborne, Victor Lavrenko

Abstract: Relevance Models are well-known retrieval models and capable of producing competitive results. However, because they use query expansion they can be very slow. We address this slowness by incorporating two variants of locality sensitive hashing (LSH) into the query expansion process. Results on two document collections suggest that we can obtain large reductions in the amount of work, with a small… ▽ More Relevance Models are well-known retrieval models and capable of producing competitive results. However, because they use query expansion they can be very slow. We address this slowness by incorporating two variants of locality sensitive hashing (LSH) into the query expansion process. Results on two document collections suggest that we can obtain large reductions in the amount of work, with a small reduction in effectiveness. Our approach is shown to be additive when pruning query terms. △ Less

Submitted 9 July, 2016; originally announced July 2016.

Comments: Information Retrieval, Query Expansion, Locality Sensitive Hashing, Randomized Algorithm, Relevance Model

Showing 1–5 of 5 results for author: Wurzer, D