-
QuOTeS: Query-Oriented Technical Summarization
Authors:
Juan Ramirez-Orta,
Eduardo Xamena,
Ana Maguitman,
Axel J. Soto,
Flavia P. Zanoto,
Evangelos Milios
Abstract:
Abstract. When writing an academic paper, researchers often spend considerable time reviewing and summarizing papers to extract relevant citations and data to compose the Introduction and Related Work sections. To address this problem, we propose QuOTeS, an interactive system designed to retrieve sentences related to a summary of the research from a collection of potential references and hence ass…
▽ More
Abstract. When writing an academic paper, researchers often spend considerable time reviewing and summarizing papers to extract relevant citations and data to compose the Introduction and Related Work sections. To address this problem, we propose QuOTeS, an interactive system designed to retrieve sentences related to a summary of the research from a collection of potential references and hence assist in the composition of new papers. QuOTeS integrates techniques from Query-Focused Extractive Summarization and High-Recall Information Retrieval to provide Interactive Query-Focused Summarization of scientific documents. To measure the performance of our system, we carried out a comprehensive user study where participants uploaded papers related to their research and evaluated the system in terms of its usability and the quality of the summaries it produces. The results show that QuOTeS provides a positive user experience and consistently provides query-focused summaries that are relevant, concise, and complete. We share the code of our system and the novel Query-Focused Summarization dataset collected during our experiments at https://github.com/jarobyte91/quotes.
△ Less
Submitted 20 June, 2023;
originally announced June 2023.
-
Post-OCR Document Correction with large Ensembles of Character Sequence-to-Sequence Models
Authors:
Juan Ramirez-Orta,
Eduardo Xamena,
Ana Maguitman,
Evangelos Milios,
Axel J. Soto
Abstract:
In this paper, we propose a novel method based on character sequence-to-sequence models to correct documents already processed with Optical Character Recognition (OCR) systems. The main contribution of this paper is a set of strategies to accurately process strings much longer than the ones used to train the sequence model while being sample- and resource-efficient, supported by thorough experimen…
▽ More
In this paper, we propose a novel method based on character sequence-to-sequence models to correct documents already processed with Optical Character Recognition (OCR) systems. The main contribution of this paper is a set of strategies to accurately process strings much longer than the ones used to train the sequence model while being sample- and resource-efficient, supported by thorough experimentation. The strategy with the best performance involves splitting the input document in character n-grams and combining their individual corrections into the final output using a voting scheme that is equivalent to an ensemble of a large number of sequence models. We further investigate how to weigh the contributions from each one of the members of this ensemble. We test our method on nine languages of the ICDAR 2019 competition on post-OCR text correction and achieve a new state-of-the-art performance in five of them. Our code for post-OCR correction is shared at https://github.com/jarobyte91/post_ocr_correction.
△ Less
Submitted 24 January, 2022; v1 submitted 13 September, 2021;
originally announced September 2021.
-
Assessing the behavior and performance of a supervised term-weighting technique for topic-based retrieval
Authors:
Mariano Maisonnave,
Fernando Delbianco,
Fernando Tohmé,
Ana Maguitman
Abstract:
This article analyses and evaluates FDD\b{eta}, a supervised term-weighting scheme that can be applied for query-term selection in topic-based retrieval. FDD\b{eta} weights terms based on two factors representing the descriptive and discriminating power of the terms with respect to the given topic. It then combines these two factor through the use of an adjustable parameter that allows to favor di…
▽ More
This article analyses and evaluates FDD\b{eta}, a supervised term-weighting scheme that can be applied for query-term selection in topic-based retrieval. FDD\b{eta} weights terms based on two factors representing the descriptive and discriminating power of the terms with respect to the given topic. It then combines these two factor through the use of an adjustable parameter that allows to favor different aspects of retrieval, such as precision, recall or a balance between both. The article makes the following contributions: (1) it presents an extensive analysis of the behavior of FDD\b{eta} as a function of its adjustable parameter; (2) it compares FDD\b{eta} against eighteen traditional and state-of-the-art weighting scheme; (3) it evaluates the performance of disjunctive queries built by combining terms selected using the analyzed methods; (4) it introduces a new public data set with news labeled as relevant or irrelevant to the economic domain. The analysis and evaluations are performed on three data sets: two well-known text data sets, namely 20 Newsgroups and Reuters-21578, and the newly released data set. It is possible to conclude that despite its simplicity, FDD\b{eta} is competitive with state-of-the-art methods and has the important advantage of offering flexibility at the moment of adapting to specific task goals. The results also demonstrate that FDD\b{eta} offers a useful mechanism to explore different approaches to build complex queries.
△ Less
Submitted 16 July, 2020; v1 submitted 13 July, 2020;
originally announced July 2020.
-
Detecting Ongoing Events Using Contextual Word and Sentence Embeddings
Authors:
Mariano Maisonnave,
Fernando Delbianco,
Fernando Tohmé,
Ana Maguitman,
Evangelos Milios
Abstract:
This paper introduces the Ongoing Event Detection (OED) task, which is a specific Event Detection task where the goal is to detect ongoing event mentions only, as opposed to historical, future, hypothetical, or other forms or events that are neither fresh nor current. Any application that needs to extract structured information about ongoing events from unstructured texts can take advantage of an…
▽ More
This paper introduces the Ongoing Event Detection (OED) task, which is a specific Event Detection task where the goal is to detect ongoing event mentions only, as opposed to historical, future, hypothetical, or other forms or events that are neither fresh nor current. Any application that needs to extract structured information about ongoing events from unstructured texts can take advantage of an OED system. The main contribution of this paper are the following: (1) it introduces the OED task along with a dataset manually labeled for the task; (2) it presents the design and implementation of an RNN model for the task that uses BERT embeddings to define contextual word and contextual sentence embeddings as attributes, which to the best of our knowledge were never used before for detecting ongoing events in news; (3) it presents an extensive empirical evaluation that includes (i) the exploration of different architectures and hyperparameters, (ii) an ablation test to study the impact of each attribute, and (iii) a comparison with a replication of a state-of-the-art model. The results offer several insights into the importance of contextual embeddings and indicate that the proposed approach is effective in the OED task, outperforming the baseline models.
△ Less
Submitted 5 February, 2021; v1 submitted 2 July, 2020;
originally announced July 2020.
-
Métodos para la Selección y el Ajuste de Características en el Problema de la Detección de Spam
Authors:
Carlos M. Lorenzetti,
Rocío L. Cecchini,
Ana G. Maguitman,
András A. Benczúr
Abstract:
The email is used daily by millions of people to communicate around the globe and it is a mission-critical application for many businesses. Over the last decade, unsolicited bulk email has become a major problem for email users. An overwhelming amount of spam is flowing into users' mailboxes daily. In 2004, an estimated 62% of all email was attributed to spam. Spam is not only frustrating for most…
▽ More
The email is used daily by millions of people to communicate around the globe and it is a mission-critical application for many businesses. Over the last decade, unsolicited bulk email has become a major problem for email users. An overwhelming amount of spam is flowing into users' mailboxes daily. In 2004, an estimated 62% of all email was attributed to spam. Spam is not only frustrating for most email users, it strains the IT infrastructure of organizations and costs businesses billions of dollars in lost productivity. In recent years, spam has evolved from an annoyance into a serious security threat, and is now a prime medium for phishing of sensitive information, as well the spread of malicious software. This work presents a first approach to attack the spam problem. We propose an algorithm that will improve a classifier's results by adjusting its training set data. It improves the document's vocabulary representation by detecting good topic descriptors and discriminators.
△ Less
Submitted 14 October, 2010; v1 submitted 1 June, 2010;
originally announced June 2010.
-
Learning Better Context Characterizations: An Intelligent Information Retrieval Approach
Authors:
Carlos M. Lorenzetti,
Ana G. Maguitman
Abstract:
This paper proposes an incremental method that can be used by an intelligent system to learn better descriptions of a thematic context. The method starts with a small number of terms selected from a simple description of the topic under analysis and uses this description as the initial search context. Using these terms, a set of queries are built and submitted to a search engine. New documents and…
▽ More
This paper proposes an incremental method that can be used by an intelligent system to learn better descriptions of a thematic context. The method starts with a small number of terms selected from a simple description of the topic under analysis and uses this description as the initial search context. Using these terms, a set of queries are built and submitted to a search engine. New documents and terms are used to refine the learned vocabulary. Evaluations performed on a large number of topics indicate that the learned vocabulary is much more effective than the original one at the time of constructing queries to retrieve relevant material.
△ Less
Submitted 27 April, 2010; v1 submitted 20 April, 2010;
originally announced April 2010.
-
Uncovering protein interaction in abstracts and text using a novel linear model and word proximity networks
Authors:
Alaa Abi-Haidar,
Jasleen Kaur,
Ana G. Maguitman,
Predrag Radivojac,
Andreas Retchsteiner,
Karin Verspoor,
Zhi** Wang,
Luis M. Rocha
Abstract:
We participated in three of the protein-protein interaction subtasks of the Second BioCreative Challenge: classification of abstracts relevant for protein-protein interaction (IAS), discovery of protein pairs (IPS) and text passages characterizing protein interaction (ISS) in full text documents. We approached the abstract classification task with a novel, lightweight linear model inspired by sp…
▽ More
We participated in three of the protein-protein interaction subtasks of the Second BioCreative Challenge: classification of abstracts relevant for protein-protein interaction (IAS), discovery of protein pairs (IPS) and text passages characterizing protein interaction (ISS) in full text documents. We approached the abstract classification task with a novel, lightweight linear model inspired by spam-detection techniques, as well as an uncertainty-based integration scheme. We also used a Support Vector Machine and the Singular Value Decomposition on the same features for comparison purposes. Our approach to the full text subtasks (protein pair and passage identification) includes a feature expansion method based on word-proximity networks. Our approach to the abstract classification task (IAS) was among the top submissions for this task in terms of the measures of performance used in the challenge evaluation (accuracy, F-score and AUC). We also report on a web-tool we produced using our approach: the Protein Interaction Abstract Relevance Evaluator (PIARE). Our approach to the full text tasks resulted in one of the highest recall rates as well as mean reciprocal rank of correct passages. Our approach to abstract classification shows that a simple linear model, using relatively few features, is capable of generalizing and uncovering the conceptual nature of protein-protein interaction from the bibliome. Since the novel approach is based on a very lightweight linear model, it can be easily ported and applied to similar problems. In full text problems, the expansion of word features with word-proximity networks is shown to be useful, though the need for some improvements is discussed.
△ Less
Submitted 4 December, 2008;
originally announced December 2008.
-
Decoding the structure of the WWW: facts versus sampling biases
Authors:
M. Angeles Serrano,
Ana Maguitman,
Marian Boguna,
Santo Fortunato,
Alessandro Vespignani
Abstract:
The understanding of the immense and intricate topological structure of the World Wide Web (WWW) is a major scientific and technological challenge. This has been tackled recently by characterizing the properties of its representative graphs in which vertices and directed edges are identified with web-pages and hyperlinks, respectively. Data gathered in large scale crawls have been analyzed by se…
▽ More
The understanding of the immense and intricate topological structure of the World Wide Web (WWW) is a major scientific and technological challenge. This has been tackled recently by characterizing the properties of its representative graphs in which vertices and directed edges are identified with web-pages and hyperlinks, respectively. Data gathered in large scale crawls have been analyzed by several groups resulting in a general picture of the WWW that encompasses many of the complex properties typical of rapidly evolving networks. In this paper, we report a detailed statistical analysis of the topological properties of four different WWW graphs obtained with different crawlers. We find that, despite the very large size of the samples, the statistical measures characterizing these graphs differ quantitatively, and in some cases qualitatively, depending on the domain analyzed and the crawl used for gathering the data. This spurs the issue of the presence of sampling biases and structural differences of Web crawls that might induce properties not representative of the actual global underlying graph. In order to provide a more accurate characterization of the Web graph and identify observables which are clearly discriminating with respect to the sampling process, we study the behavior of degree-degree correlation functions and the statistics of reciprocal connections. The latter appears to enclose the relevant correlations of the WWW graph and carry most of the topological information of theWeb. The analysis of this quantity is also of major interest in relation to the navigability and searchability of the Web.
△ Less
Submitted 14 February, 2006; v1 submitted 8 November, 2005;
originally announced November 2005.