Search | arXiv e-print repository

doi 10.1007/s10844-022-00751-3

Exploiting Hierarchical Dependence Structures for Unsupervised Rank Fusion in Information Retrieval

Authors: J. Hermosillo-Valadez, E. Morales-González, F. Fernández-Reyes, M. Montes-y-Gómez, J. Fuentes-Pacheco, J. M. Rendón-Mancha

Abstract: The goal of rank fusion in information retrieval (IR) is to deliver a single output list from multiple search results. Improving performance by combining the outputs of various IR systems is a challenging task. A central point is the fact that many non-obvious factors are involved in the estimation of relevance, inducing nonlinear interrelations between the data. The ability to model complex depen… ▽ More The goal of rank fusion in information retrieval (IR) is to deliver a single output list from multiple search results. Improving performance by combining the outputs of various IR systems is a challenging task. A central point is the fact that many non-obvious factors are involved in the estimation of relevance, inducing nonlinear interrelations between the data. The ability to model complex dependency relationships between random variables has become increasingly popular in the realm of information retrieval, and the need to further explore these dependencies for data fusion has been recently acknowledged. Copulas provide a framework to separate the dependence structure from the margins. Inspired by the theory of copulas, we propose a new unsupervised, dynamic, nonlinear, rank fusion method, based on a nested composition of non-algebraic function pairs. The dependence structure of the model is tailored by leveraging query-document correlations on a per-query basis. We experimented with three topic sets over CLEF corpora fusing 3 and 6 retrieval systems, comparing our method against the CombMNZ technique and other nonlinear unsupervised strategies. The experiments show that our fusion approach improves performance under explicit conditions, providing insight about the circumstances under which linear fusion techniques have comparable performance to nonlinear methods. △ Less

Submitted 10 August, 2022; originally announced August 2022.

Comments: 17 pages, 5 figures

ACM Class: H.3

Journal ref: Journal of Intelligent Information Systems 2022

arXiv:2103.01334 [pdf, other]

Deep Bag-of-Sub-Emotions for Depression Detection in Social Media

Authors: Juan S. Lara, Mario Ezra Aragon, Fabio A. Gonzalez, Manuel Montes-y-Gomez

Abstract: This paper presents the Deep Bag-of-Sub-Emotions (DeepBoSE), a novel deep learning model for depression detection in social media. The model is formulated such that it internally computes a differentiable Bag-of-Features (BoF) representation that incorporates emotional information. This is achieved by a reinterpretation of classical weighting schemes like term frequency-inverse document frequency… ▽ More This paper presents the Deep Bag-of-Sub-Emotions (DeepBoSE), a novel deep learning model for depression detection in social media. The model is formulated such that it internally computes a differentiable Bag-of-Features (BoF) representation that incorporates emotional information. This is achieved by a reinterpretation of classical weighting schemes like term frequency-inverse document frequency into probabilistic deep learning operations. An important advantage of the proposed method is that it can be trained under the transfer learning paradigm, which is useful to enhance conventional BoF models that cannot be directly integrated into deep learning architectures. Experiments were performed in the eRisk17 and eRisk18 datasets for the depression detection task; results show that DeepBoSE outperforms conventional BoF representations and it is competitive with the state of the art, achieving a F1-score over the positive class of 0.64 in eRisk17 and 0.65 in eRisk18. △ Less

Submitted 1 March, 2021; originally announced March 2021.

arXiv:1912.09322 [pdf, other]

PySS3: A Python package implementing a novel text classifier with visualization tools for Explainable AI

Authors: Sergio G. Burdisso, Marcelo Errecalde, Manuel Montes-y-Gómez

Abstract: A recently introduced text classifier, called SS3, has obtained state-of-the-art performance on the CLEF's eRisk tasks. SS3 was created to deal with risk detection over text streams and, therefore, not only supports incremental training and classification but also can visually explain its rationale. However, little attention has been paid to the potential use of SS3 as a general classifier. We bel… ▽ More A recently introduced text classifier, called SS3, has obtained state-of-the-art performance on the CLEF's eRisk tasks. SS3 was created to deal with risk detection over text streams and, therefore, not only supports incremental training and classification but also can visually explain its rationale. However, little attention has been paid to the potential use of SS3 as a general classifier. We believe this could be due to the unavailability of an open-source implementation of SS3. In this work, we introduce PySS3, a package that implements SS3 and also comes with visualization tools that allow researchers to deploy robust, explainable, and trusty machine learning models for text classification. △ Less

Submitted 17 July, 2020; v1 submitted 19 December, 2019; originally announced December 2019.

arXiv:1911.06147 [pdf, other]

doi 10.1016/j.patrec.2020.07.001

t-SS3: a text classifier with dynamic n-grams for early risk detection over text streams

Authors: Sergio G. Burdisso, Marcelo Errecalde, Manuel Montes-y-Gómez

Abstract: A recently introduced classifier, called SS3, has shown to be well suited to deal with early risk detection (ERD) problems on text streams. It obtained state-of-the-art performance on early depression and anorexia detection on Reddit in the CLEF's eRisk open tasks. SS3 was created to deal with ERD problems naturally since: it supports incremental training and classification over text streams, and… ▽ More A recently introduced classifier, called SS3, has shown to be well suited to deal with early risk detection (ERD) problems on text streams. It obtained state-of-the-art performance on early depression and anorexia detection on Reddit in the CLEF's eRisk open tasks. SS3 was created to deal with ERD problems naturally since: it supports incremental training and classification over text streams, and it can visually explain its rationale. However, SS3 processes the input using a bag-of-word model lacking the ability to recognize important word sequences. This aspect could negatively affect the classification performance and also reduces the descriptiveness of visual explanations. In the standard document classification field, it is very common to use word n-grams to try to overcome some of these limitations. Unfortunately, when working with text streams, using n-grams is not trivial since the system must learn and recognize which n-grams are important "on the fly". This paper introduces t-SS3, an extension of SS3 that allows it to recognize useful patterns over text streams dynamically. We evaluated our model in the eRisk 2017 and 2018 tasks on early depression and anorexia detection. Experimental results suggest that t-SS3 is able to improve both current results and the richness of visual explanations. △ Less

Submitted 6 May, 2020; v1 submitted 11 November, 2019; originally announced November 2019.

Comments: Highlights: (*) A classifier that is able to dynamically learn and recognize important word n-grams. (*) A novel text classifier having the ability to visually explain its rationale. (*) Support for incremental learning and text classification over streams. (*) Efficient model for addressing early risk detection problems

Journal ref: Pattern Recognition Letters, Elsevier, 2020

arXiv:1906.04836 [pdf, other]

Unmasking Bias in News

Authors: Javier Sánchez-Junquera, Paolo Rosso, Manuel Montes-y-Gómez, Simone Paolo Ponzetto

Abstract: We present experiments on detecting hyperpartisanship in news using a 'masking' method that allows us to assess the role of style vs. content for the task at hand. Our results corroborate previous research on this task in that topic related features yield better results than stylistic ones. We additionally show that competitive results can be achieved by simply including higher-length n-grams, whi… ▽ More We present experiments on detecting hyperpartisanship in news using a 'masking' method that allows us to assess the role of style vs. content for the task at hand. Our results corroborate previous research on this task in that topic related features yield better results than stylistic ones. We additionally show that competitive results can be achieved by simply including higher-length n-grams, which suggests the need to develop more challenging datasets and tasks that address implicit and more subtle forms of bias. △ Less

Submitted 11 June, 2019; originally announced June 2019.

arXiv:1905.08780 [pdf, other]

doi 10.3233/JIFS-179033

A Comparative Analysis of Distributional Term Representations for Author Profiling in Social Media

Authors: Miguel Á. Álvarez-Carmona, Esaú Villatoro-Tello, Manuel Montes-y-Gómez, Luis Villaseñor-Pienda

Abstract: Author Profiling (AP) aims at predicting specific characteristics from a group of authors by analyzing their written documents. Many research has been focused on determining suitable features for modeling writing patterns from authors. Reported results indicate that content-based features continue to be the most relevant and discriminant features for solving this task. Thus, in this paper, we pres… ▽ More Author Profiling (AP) aims at predicting specific characteristics from a group of authors by analyzing their written documents. Many research has been focused on determining suitable features for modeling writing patterns from authors. Reported results indicate that content-based features continue to be the most relevant and discriminant features for solving this task. Thus, in this paper, we present a thorough analysis regarding the appropriateness of different distributional term representations (DTR) for the AP task. In this regard, we introduce a novel framework for supervised AP using these representations and, supported on it. We approach a comparative analysis of representations such as DOR, TCOR, SSR, and word2vec in the AP problem. We also compare the performance of the DTRs against classic approaches including popular topic-based methods. The obtained results indicate that DTRs are suitable for solving the AP task in social media domains as they achieve competitive results while providing meaningful interpretability. △ Less

Submitted 21 May, 2019; originally announced May 2019.

Journal ref: Journal of Intelligent & Fuzzy Systems, vol. 36, no. 5, pp. 4857-4868, 2019

arXiv:1905.08772 [pdf, other]

doi 10.1016/j.eswa.2019.05.023

A Text Classification Framework for Simple and Effective Early Depression Detection Over Social Media Streams

Authors: Sergio G. Burdisso, Marcelo Errecalde, Manuel Montes-y-Gómez

Abstract: With the rise of the Internet, there is a growing need to build intelligent systems that are capable of efficiently dealing with early risk detection (ERD) problems on social media, such as early depression detection, early rumor detection or identification of sexual predators. These systems, nowadays mostly based on machine learning techniques, must be able to deal with data streams since users p… ▽ More With the rise of the Internet, there is a growing need to build intelligent systems that are capable of efficiently dealing with early risk detection (ERD) problems on social media, such as early depression detection, early rumor detection or identification of sexual predators. These systems, nowadays mostly based on machine learning techniques, must be able to deal with data streams since users provide their data over time. In addition, these systems must be able to decide when the processed data is sufficient to actually classify users. Moreover, since ERD tasks involve risky decisions by which people's lives could be affected, such systems must also be able to justify their decisions. However, most standard and state-of-the-art supervised machine learning models are not well suited to deal with this scenario. This is due to the fact that they either act as black boxes or do not support incremental classification/learning. In this paper we introduce SS3, a novel supervised learning model for text classification that naturally supports these aspects. SS3 was designed to be used as a general framework to deal with ERD problems. We evaluated our model on the CLEF's eRisk2017 pilot task on early depression detection. Most of the 30 contributions submitted to this competition used state-of-the-art methods. Experimental results show that our classifier was able to outperform these models and standard classifiers, despite being less computationally expensive and having the ability to explain its rationale. △ Less

Submitted 17 April, 2024; v1 submitted 18 May, 2019; originally announced May 2019.

Comments: Highlights: (*) A novel text classifier having the ability to visually explain its rationale; (*) Domain-independent classification that does not require feature engineering; (*) Support for incremental learning and text classification over streams; (*) Efficient framework for addressing early risk detection problems; (*) State-of-the-art performance on early depression detection task

Journal ref: 18 May 2019, Volume 133, Expert Systems With Applications, Elsevier

arXiv:1805.11611 [pdf, other]

doi 10.3233/JIFS-169483

Semantically-informed distance and similarity measures for paraphrase plagiarism identification

Authors: Miguel A. Álvarez-Carmona, Marc Franco-Salvador, Esaú Villatoro-Tello, Manuel Montes-y-Gómez, Paolo Rosso, Luis Villaseñor-Pineda

Abstract: Paraphrase plagiarism identification represents a very complex task given that plagiarized texts are intentionally modified through several rewording techniques. Accordingly, this paper introduces two new measures for evaluating the relatedness of two given texts: a semantically-informed similarity measure and a semantically-informed edit distance. Both measures are able to extract semantic inform… ▽ More Paraphrase plagiarism identification represents a very complex task given that plagiarized texts are intentionally modified through several rewording techniques. Accordingly, this paper introduces two new measures for evaluating the relatedness of two given texts: a semantically-informed similarity measure and a semantically-informed edit distance. Both measures are able to extract semantic information from either an external resource or a distributed representation of words, resulting in informative features for training a supervised classifier for detecting paraphrase plagiarism. Obtained results indicate that the proposed metrics are consistently good in detecting different types of paraphrase plagiarism. In addition, results are very competitive against state-of-the art methods having the advantage of representing a much more simple but equally effective solution. △ Less

Submitted 29 May, 2018; originally announced May 2018.

Journal ref: Journal of Intelligent & Fuzzy Systems, vol. 34, no. 5, pp. 2983-2990, 2018

arXiv:1805.11166 [pdf, other]

doi 10.3233/JIFS-169497

A visual approach for age and gender identification on Twitter

Authors: Miguel A. Alvarez-Carmona, Luis Pellegrin, Manuel Montes-y-Gómez, Fernando Sánchez-Vega, Hugo Jair Escalante, A. Pastor López-Monroy, Luis Villaseñor-Pineda, Esaú Villatoro-Tello

Abstract: The goal of Author Profiling (AP) is to identify demographic aspects (e.g., age, gender) from a given set of authors by analyzing their written texts. Recently, the AP task has gained interest in many problems related to computer forensics, psychology, marketing, but specially in those related with social media exploitation. As known, social media data is shared through a wide range of modalities… ▽ More The goal of Author Profiling (AP) is to identify demographic aspects (e.g., age, gender) from a given set of authors by analyzing their written texts. Recently, the AP task has gained interest in many problems related to computer forensics, psychology, marketing, but specially in those related with social media exploitation. As known, social media data is shared through a wide range of modalities (e.g., text, images and audio), representing valuable information to be exploited for extracting valuable insights from users. Nevertheless, most of the current work in AP using social media data has been devoted to analyze textual information only, and there are very few works that have started exploring the gender identification using visual information. Contrastingly, this paper focuses in exploiting the visual modality to perform both age and gender identification in social media, specifically in Twitter. Our goal is to evaluate the pertinence of using visual information in solving the AP task. Accordingly, we have extended the Twitter corpus from PAN 2014, incorporating posted images from all the users, making a distinction between tweeted and retweeted images. Performed experiments provide interesting evidence on the usefulness of visual information in comparison with traditional textual representations for the AP task. △ Less

Submitted 28 May, 2018; originally announced May 2018.

Journal ref: Miguel A. Alvarez-Carmona, Luis Pellegrin et al. A visual approach for age and gender identification on Twitter. Journal of Intelligent and Fuzzy Systems, vol. 34, no. 5, pp. 3133-3145, 2018

arXiv:1805.09746 [pdf, other]

Letting Emotions Flow: Success Prediction by Modeling the Flow of Emotions in Books

Authors: Suraj Maharjan, Sudipta Kar, Manuel Montes-y-Gomez, Fabio A. Gonzalez, Thamar Solorio

Abstract: Books have the power to make us feel happiness, sadness, pain, surprise, or sorrow. An author's dexterity in the use of these emotions captivates readers and makes it difficult for them to put the book down. In this paper, we model the flow of emotions over a book using recurrent neural networks and quantify its usefulness in predicting success in books. We obtained the best weighted F1-score of 6… ▽ More Books have the power to make us feel happiness, sadness, pain, surprise, or sorrow. An author's dexterity in the use of these emotions captivates readers and makes it difficult for them to put the book down. In this paper, we model the flow of emotions over a book using recurrent neural networks and quantify its usefulness in predicting success in books. We obtained the best weighted F1-score of 69% for predicting books' success in a multitask setting (simultaneously predicting success and genre of books). △ Less

Submitted 24 May, 2018; v1 submitted 24 May, 2018; originally announced May 2018.

Comments: NAACL 2018, 7 pages

arXiv:1702.01992 [pdf, other]

Gated Multimodal Units for Information Fusion

Authors: John Arevalo, Thamar Solorio, Manuel Montes-y-Gómez, Fabio A. González

Abstract: This paper presents a novel model for multimodal learning based on gated neural networks. The Gated Multimodal Unit (GMU) model is intended to be used as an internal unit in a neural network architecture whose purpose is to find an intermediate representation based on a combination of data from different modalities. The GMU learns to decide how modalities influence the activation of the unit using… ▽ More This paper presents a novel model for multimodal learning based on gated neural networks. The Gated Multimodal Unit (GMU) model is intended to be used as an internal unit in a neural network architecture whose purpose is to find an intermediate representation based on a combination of data from different modalities. The GMU learns to decide how modalities influence the activation of the unit using multiplicative gates. It was evaluated on a multilabel scenario for genre classification of movies using the plot and the poster. The GMU improved the macro f-score performance of single-modality approaches and outperformed other fusion strategies, including mixture of experts models. Along with this work, the MM-IMDb dataset is released which, to the best of our knowledge, is the largest publicly available multimodal dataset for genre prediction on movies. △ Less

Submitted 7 February, 2017; originally announced February 2017.

arXiv:1509.06053 [pdf, ps, other]

Early text classification: a Naive solution

Authors: Hugo Jair Escalante, Manuel Montes-y-Gómez, Luis Villaseñor-Pineda, Marcelo Luis Errecalde

Abstract: Text classification is a widely studied problem, and it can be considered solved for some domains and under certain circumstances. There are scenarios, however, that have received little or no attention at all, despite its relevance and applicability. One of such scenarios is early text classification, where one needs to know the category of a document by using partial information only. A document… ▽ More Text classification is a widely studied problem, and it can be considered solved for some domains and under certain circumstances. There are scenarios, however, that have received little or no attention at all, despite its relevance and applicability. One of such scenarios is early text classification, where one needs to know the category of a document by using partial information only. A document is processed as a sequence of terms, and the goal is to devise a method that can make predictions as fast as possible. The importance of this variant of the text classification problem is evident in domains like sexual predator detection, where one wants to identify an offender as early as possible. This paper analyzes the suitability of the standard naive Bayes classifier for approaching this problem. Specifically, we assess its performance when classifying documents after seeing an increasingly number of terms. A simple modification to the standard naive Bayes implementation allows us to make predictions with partial information. To the best of our knowledge naive Bayes has not been used for this purpose before. Throughout an extensive experimental evaluation we show the effectiveness of the classifier for early text classification. What is more, we show that this simple solution is very competitive when compared with state of the art methodologies that are more elaborated. We foresee our work will pave the way for the development of more effective early text classification techniques based in the naive Bayes formulation. △ Less

Submitted 20 September, 2015; originally announced September 2015.

Comments: 8 pages, preprint submitted to SDM'16

arXiv:1410.0640 [pdf, ps, other]

Term-Weighting Learning via Genetic Programming for Text Classification

Authors: Hugo Jair Escalante, Mauricio A. García-Limón, Alicia Morales-Reyes, Mario Graff, Manuel Montes-y-Gómez, Eduardo F. Morales

Abstract: This paper describes a novel approach to learning term-weighting schemes (TWSs) in the context of text classification. In text mining a TWS determines the way in which documents will be represented in a vector space model, before applying a classifier. Whereas acceptable performance has been obtained with standard TWSs (e.g., Boolean and term-frequency schemes), the definition of TWSs has been tra… ▽ More This paper describes a novel approach to learning term-weighting schemes (TWSs) in the context of text classification. In text mining a TWS determines the way in which documents will be represented in a vector space model, before applying a classifier. Whereas acceptable performance has been obtained with standard TWSs (e.g., Boolean and term-frequency schemes), the definition of TWSs has been traditionally an art. Further, it is still a difficult task to determine what is the best TWS for a particular problem and it is not clear yet, whether better schemes, than those currently available, can be generated by combining known TWS. We propose in this article a genetic program that aims at learning effective TWSs that can improve the performance of current schemes in text classification. The genetic program learns how to combine a set of basic units to give rise to discriminative TWSs. We report an extensive experimental study comprising data sets from thematic and non-thematic text classification as well as from image classification. Our study shows the validity of the proposed method; in fact, we show that TWSs learned with the genetic program outperform traditional schemes and other TWSs proposed in recent works. Further, we show that TWSs learned from a specific domain can be effectively used for other tasks. △ Less

Submitted 6 October, 2014; v1 submitted 2 October, 2014; originally announced October 2014.

MSC Class: 68T50; 68T10

Showing 1–13 of 13 results for author: Montes-y-Gómez, M