Search | arXiv e-print repository

No Pattern, No Recognition: a Survey about Reproducibility and Distortion Issues of Text Clustering and Topic Modeling

Authors: Marília Costa Rosendo Silva, Felipe Alves Siqueira, João Pedro Mantovani Tarrega, João Vitor Pataca Beinotti, Augusto Sousa Nunes, Miguel de Mattos Gardini, Vinícius Adolfo Pereira da Silva, Nádia Félix Felipe da Silva, André Carlos Ponce de Leon Ferreira de Carvalho

Abstract: Extracting knowledge from unlabeled texts using machine learning algorithms can be complex. Document categorization and information retrieval are two applications that may benefit from unsupervised learning (e.g., text clustering and topic modeling), including exploratory data analysis. However, the unsupervised learning paradigm poses reproducibility issues. The initialization can lead to variabi… ▽ More Extracting knowledge from unlabeled texts using machine learning algorithms can be complex. Document categorization and information retrieval are two applications that may benefit from unsupervised learning (e.g., text clustering and topic modeling), including exploratory data analysis. However, the unsupervised learning paradigm poses reproducibility issues. The initialization can lead to variability depending on the machine learning algorithm. Furthermore, the distortions can be misleading when regarding cluster geometry. Amongst the causes, the presence of outliers and anomalies can be a determining factor. Despite the relevance of initialization and outlier issues for text clustering and topic modeling, the authors did not find an in-depth analysis of them. This survey provides a systematic literature review (2011-2022) of these subareas and proposes a common terminology since similar procedures have different terms. The authors describe research opportunities, trends, and open issues. The appendices summarize the theoretical background of the text vectorization, the factorization, and the clustering algorithms that are directly or indirectly related to the reviewed works. △ Less

Submitted 2 August, 2022; originally announced August 2022.

ACM Class: I.2; I.2.7; I.5.3

arXiv:2108.12214 [pdf, other]

Machine Learning for Performance Prediction of Spark Cloud Applications

Authors: Alexandre Maros, Fabricio Murai, Ana Paula Couto da Silva, Jussara M. Almeida, Marco Lattuada, Eugenio Gianniti, Marjan Hosseini, Danilo Ardagna

Abstract: Big data applications and analytics are employed in many sectors for a variety of goals: improving customers satisfaction, predicting market behavior or improving processes in public health. These applications consist of complex software stacks that are often run on cloud systems. Predicting execution times is important for estimating the cost of cloud services and for effectively managing the und… ▽ More Big data applications and analytics are employed in many sectors for a variety of goals: improving customers satisfaction, predicting market behavior or improving processes in public health. These applications consist of complex software stacks that are often run on cloud systems. Predicting execution times is important for estimating the cost of cloud services and for effectively managing the underlying resources at runtime. Machine Learning (ML), providing black box solutions to model the relationship between application performance and system configuration without requiring in-detail knowledge of the system, has become a popular way of predicting the performance of big data applications. We investigate the cost-benefits of using supervised ML models for predicting the performance of applications on Spark, one of today's most widely used frameworks for big data analysis. We compare our approach with \textit{Ernest} (an ML-based technique proposed in the literature by the Spark inventors) on a range of scenarios, application workloads, and cloud system configurations. Our experiments show that Ernest can accurately estimate the performance of very regular applications, but it fails when applications exhibit more irregular patterns and/or when extrapolating on bigger data set sizes. Results show that our models match or exceed Ernest's performance, sometimes enabling us to reduce the prediction error from 126-187% to only 5-19%. △ Less

Submitted 27 August, 2021; originally announced August 2021.

Comments: Published in 2019 IEEE 12th International Conference on Cloud Computing (CLOUD)

ACM Class: B.8.2; I.2

arXiv:2108.09778 [pdf]

"Sharing Wisdoms from the East": Develo** a Native Theory of ICT4D Using Grounded Theory Methodology (GTM) -- Experience from Timor-Leste

Authors: Abel Pires da Silva

Abstract: There have been repeated calls made for theory-building studies in ICT4D research to solidify the existence of this research field. However, theory-building studies are not yet common, even though ICT4D as a research domain is a promising venue to develop native and indigenous theories. To this end, this paper outlines a theory-building study in ICT4D, based on the author's experience in developin… ▽ More There have been repeated calls made for theory-building studies in ICT4D research to solidify the existence of this research field. However, theory-building studies are not yet common, even though ICT4D as a research domain is a promising venue to develop native and indigenous theories. To this end, this paper outlines a theory-building study in ICT4D, based on the author's experience in develo** a mid-range theory called 'Cultivating-Sustainability' of E-government projects, a native mid-range theory of ICT4D. The paper synthesizes the GTM literature and provides a step-by-step illustration of GTM use in practice for research students and early career ICT4D academics. It introduces the key strategies and principles of GTM, such as the theoretical sampling strategy, the constant comparison strategy, the concept-emergent principle, and the use of literature throughout the study process. Then discusses the steps involved in the data collection and analysis process to develop a theory using case studies as sources of empirical data; it concludes with a discussion on using the strategies and principles in the three case studies. It is expected that this paper contributes to the diversification of research methodology, particularly to our collective quest for develo** native and indigenous theories in the ICT4D research domain. △ Less

Submitted 22 August, 2021; originally announced August 2021.

Comments: In proceedings of the 1st Virtual Conference on Implications of Information and Digital Technologies for Development, 2021

arXiv:2107.04702 [pdf]

Um Metodo para Busca Automatica de Redes Neurais Artificiais

Authors: Anderson P. da Silva, Teresa B. Ludermir, Leandro M. Almeida

Abstract: This paper describes a method that automatically searches Artificial Neural Networks using Cellular Genetic Algorithms. The main difference of this method for a common genetic algorithm is the use of a cellular automaton capable of providing the location for individuals, reducing the possibility of local minima in search space. This method employs an evolutionary search for simultaneous choices of… ▽ More This paper describes a method that automatically searches Artificial Neural Networks using Cellular Genetic Algorithms. The main difference of this method for a common genetic algorithm is the use of a cellular automaton capable of providing the location for individuals, reducing the possibility of local minima in search space. This method employs an evolutionary search for simultaneous choices of initial weights, transfer functions, architectures and learning rules. Experimental results have shown that the developed method can find compact, efficient networks with a satisfactory generalization power and with shorter training times when compared to other methods found in the literature. △ Less

Submitted 9 July, 2021; originally announced July 2021.

Comments: 13 pages, in Portuguese, 4 figures, 2 tables

arXiv:2006.15401 [pdf, other]

You Shall not Pass: Avoiding Spurious Paths in Shortest-Path Based Centralities in Multidimensional Complex Networks

Authors: Klaus Wehmuth, Artur Ziviani, Leonardo Chinelate Costa, Ana Paula Couto da Silva, Alex Borges Vieira

Abstract: In complex network analysis, centralities based on shortest paths, such as betweenness and closeness, are widely used. More recently, many complex systems are being represented by time-varying, multilayer, and time-varying multilayer networks, i.e. multidimensional (or high order) networks. Nevertheless, it is well-known that the aggregation process may create spurious paths on the aggregated view… ▽ More In complex network analysis, centralities based on shortest paths, such as betweenness and closeness, are widely used. More recently, many complex systems are being represented by time-varying, multilayer, and time-varying multilayer networks, i.e. multidimensional (or high order) networks. Nevertheless, it is well-known that the aggregation process may create spurious paths on the aggregated view of such multidimensional (high order) networks. Consequently, these spurious paths may then cause shortest-path based centrality metrics to produce incorrect results, thus undermining the network centrality analysis. In this context, we propose a method able to avoid taking into account spurious paths when computing centralities based on shortest paths in multidimensional (or high order) networks. Our method is based on MultiAspect Graphs~(MAG) to represent the multidimensional networks and we show that well-known centrality algorithms can be straightforwardly adapted to the MAG environment. Moreover, we show that, by using this MAG representation, pitfalls usually associated with spurious paths resulting from aggregation in multidimensional networks can be avoided at the time of the aggregation process. As a result, shortest-path based centralities are assured to be computed correctly for multidimensional networks, without taking into account spurious paths that could otherwise lead to incorrect results. We also present a case study that shows the impact of spurious paths in the computing of shortest paths and consequently of shortest-path based centralities, such as betweenness and closeness, thus illustrating the importance of this contribution. △ Less

Submitted 19 August, 2020; v1 submitted 27 June, 2020; originally announced June 2020.

Comments: 17 pages, 6 figures

arXiv:2005.07473 [pdf, other]

doi 10.1016/j.future.2021.07.014

Predicting User Emotional Tone in Mental Disorder Online Communities

Authors: Bárbara Silveira, Henrique S. Silva, Fabricio Murai, Ana Paula Couto da Silva

Abstract: In recent years, Online Social Networks have become an important medium for people who suffer from mental disorders to share moments of hardship, and receive emotional and informational support. In this work, we analyze how discussions in Reddit communities related to mental disorders can help improve the health conditions of their users. Using the emotional tone of users' writing as a proxy for e… ▽ More In recent years, Online Social Networks have become an important medium for people who suffer from mental disorders to share moments of hardship, and receive emotional and informational support. In this work, we analyze how discussions in Reddit communities related to mental disorders can help improve the health conditions of their users. Using the emotional tone of users' writing as a proxy for emotional state, we uncover relationships between user interactions and state changes. First, we observe that authors of negative posts often write rosier comments after engaging in discussions, indicating that users' emotional state can improve due to social support. Second, we build models based on SOTA text embedding techniques and RNNs to predict shifts in emotional tone. This differs from most of related work, which focuses primarily on detecting mental disorders from user activity. We demonstrate the feasibility of accurately predicting the users' reactions to the interactions experienced in these platforms, and present some examples which illustrate that the models are correctly capturing the effects of comments on the author's emotional tone. Our models hold promising implications for interventions to provide support for people struggling with mental illnesses. △ Less

Submitted 27 July, 2021; v1 submitted 15 May, 2020; originally announced May 2020.

Comments: 8 pages, 3 figures, 3 tables

ACM Class: J.3; I.2.7

Journal ref: Future Generation Computer Systems, Volume 125, 2021, Pages 641-651, ISSN 0167-739X

arXiv:1904.11719 [pdf, other]

doi 10.1145/3342220.3343657

Towards Understanding Political Interactions on Instagram

Authors: Martino Trevisan, Luca Vassio, Idilio Drago, Marco Mellia, Fabricio Murai, Flavio Figueiredo, Ana Paula Couto da Silva, Jussara M. Almeida

Abstract: Online Social Networks (OSNs) allow personalities and companies to communicate directly with the public, bypassing filters of traditional medias. As people rely on OSNs to stay up-to-date, the political debate has moved online too. We witness the sudden explosion of harsh political debates and the dissemination of rumours in OSNs. Identifying such behaviour requires a deep understanding on how peo… ▽ More Online Social Networks (OSNs) allow personalities and companies to communicate directly with the public, bypassing filters of traditional medias. As people rely on OSNs to stay up-to-date, the political debate has moved online too. We witness the sudden explosion of harsh political debates and the dissemination of rumours in OSNs. Identifying such behaviour requires a deep understanding on how people interact via OSNs during political debates. We present a preliminary study of interactions in a popular OSN, namely Instagram. We take Italy as a case study in the period before the 2019 European Elections. We observe the activity of top Italian Instagram profiles in different categories: politics, music, sport and show. We record their posts for more than two months, tracking "likes" and comments from users. Results suggest that profiles of politicians attract markedly different interactions than other categories. People tend to comment more, with longer comments, debating for longer time, with a large number of replies, most of which are not explicitly solicited. Moreover, comments tend to come from a small group of very active users. Finally, we witness substantial differences when comparing profiles of different parties. △ Less

Submitted 4 May, 2021; v1 submitted 26 April, 2019; originally announced April 2019.

Comments: 5 pages, 8 figures, Proceedings of the 30th ACM Conference on Hypertext and Social Media, https://dl.acm.org/doi/10.1145/3342220.3343657

Journal ref: HT19: Proceedings of the 30th ACM Conference on Hypertext and Social Media. September 2019. Pages 247-251. Association for Computing Machinery

arXiv:1803.03497 [pdf, other]

doi 10.5753/wperformance.2018.3343

Modelos de Resposta para Experimentos Randomizados em Redes Sociais de Larga Escala

Authors: Francisco Galuppo Azevedo, Bruno Demattos Nogueira, Fabricio Murai, Ana Paula Couto da Silva

Abstract: A/B tests are randomized experiments frequently used by companies that offer services on the Web for assessing the impact of new features. During an experiment, each user is randomly redirected to one of two versions of the website, called treatments. Several response models were proposed to describe the behavior of a user in a social network website, where the treatment assigned to her neighbors… ▽ More A/B tests are randomized experiments frequently used by companies that offer services on the Web for assessing the impact of new features. During an experiment, each user is randomly redirected to one of two versions of the website, called treatments. Several response models were proposed to describe the behavior of a user in a social network website, where the treatment assigned to her neighbors must be taken into account. However, there is no consensus as to which model should be applied to a given dataset. In this work, we propose a new response model, derive theoretical limits for the estimation error of several models, and obtain empirical results for cases where the response model was misspecified. △ Less

Submitted 9 March, 2018; originally announced March 2018.

Comments: 15 pages, in Portuguese, 2 figures, submitted to SBC WPerformance 2018

arXiv:1504.00241 [pdf, other]

doi 10.1142/S021952591550023X

Time Centrality in Dynamic Complex Networks

Authors: Eduardo Chinelate Costa, Alex Borges Vieira, Klaus Wehmuth, Artur Ziviani, Ana Paula Couto da Silva

Abstract: There is an ever-increasing interest in investigating dynamics in time-varying graphs (TVGs). Nevertheless, so far, the notion of centrality in TVG scenarios usually refers to metrics that assess the relative importance of nodes along the temporal evolution of the dynamic complex network. For some TVG scenarios, however, more important than identifying the central nodes under a given node centrali… ▽ More There is an ever-increasing interest in investigating dynamics in time-varying graphs (TVGs). Nevertheless, so far, the notion of centrality in TVG scenarios usually refers to metrics that assess the relative importance of nodes along the temporal evolution of the dynamic complex network. For some TVG scenarios, however, more important than identifying the central nodes under a given node centrality definition is identifying the key time instants for taking certain actions. In this paper, we thus introduce and investigate the notion of time centrality in TVGs. Analogously to node centrality, time centrality evaluates the relative importance of time instants in dynamic complex networks. In this context, we present two time centrality metrics related to diffusion processes. We evaluate the two defined metrics using both a real-world dataset representing an in-person contact dynamic network and a synthetically generated randomized TVG. We validate the concept of time centrality showing that diffusion starting at the best classified time instants (i.e. the most central ones), according to our metrics, can perform a faster and more efficient diffusion process. △ Less

Submitted 5 September, 2015; v1 submitted 1 April, 2015; originally announced April 2015.

Journal ref: Advances in Complex Systems (ACS), vol. 18, no. 07n08, November & December 2015

arXiv:1202.0024 [pdf, other]

doi 10.1088/1742-5468/2012/07/P07005

Predicting epidemic outbreak from individual features of the spreaders

Authors: Renato Aparecido Pimentel da Silva, Matheus Palhares Viana, Luciano da Fontoura Costa

Abstract: Knowing which individuals can be more efficient in spreading a pathogen throughout a determinate environment is a fundamental question in disease control. Indeed, over the last years the spread of epidemic diseases and its relationship with the topology of the involved system have been a recurrent topic in complex network theory, taking into account both network models and real-world data. In this… ▽ More Knowing which individuals can be more efficient in spreading a pathogen throughout a determinate environment is a fundamental question in disease control. Indeed, over the last years the spread of epidemic diseases and its relationship with the topology of the involved system have been a recurrent topic in complex network theory, taking into account both network models and real-world data. In this paper we explore possible correlations between the heterogeneous spread of an epidemic disease governed by the susceptible-infected-recovered (SIR) model, and several attributes of the originating vertices, considering Erdös-Rényi (ER), Barabási-Albert (BA) and random geometric graphs (RGG), as well as a real case of study, the US Air Transportation Network that comprises the US 500 busiest airports along with inter-connections. Initially, the heterogeneity of the spreading is achieved considering the RGG networks, in which we analytically derive an expression for the distribution of the spreading rates among the established contacts, by assuming that such rates decay exponentially with the distance that separates the individuals. Such distribution is also considered for the ER and BA models, where we observe topological effects on the correlations. In the case of the airport network, the spreading rates are empirically defined, assumed to be directly proportional to the seat availability. Among both the theoretical and the real networks considered, we observe a high correlation between the total epidemic prevalence and the degree, as well as the strength and the accessibility of the epidemic sources. For attributes such as the betweenness centrality and the $k$-shell index, however, the correlation depends on the topology considered. △ Less

Submitted 15 June, 2012; v1 submitted 31 January, 2012; originally announced February 2012.

Comments: 10 pages, 6 figures

Showing 1–10 of 10 results for author: da Silva, A P