Search | arXiv e-print repository

Noisy Data Visualization using Functional Data Analysis

Authors: Haozhe Chen, Andres Felipe Duque Correa, Guy Wolf, Kevin R. Moon

Abstract: Data visualization via dimensionality reduction is an important tool in exploratory data analysis. However, when the data are noisy, many existing methods fail to capture the underlying structure of the data. The method called Empirical Intrinsic Geometry (EIG) was previously proposed for performing dimensionality reduction on high dimensional dynamical processes while theoretically eliminating al… ▽ More Data visualization via dimensionality reduction is an important tool in exploratory data analysis. However, when the data are noisy, many existing methods fail to capture the underlying structure of the data. The method called Empirical Intrinsic Geometry (EIG) was previously proposed for performing dimensionality reduction on high dimensional dynamical processes while theoretically eliminating all noise. However, implementing EIG in practice requires the construction of high-dimensional histograms, which suffer from the curse of dimensionality. Here we propose a new data visualization method called Functional Information Geometry (FIG) for dynamical processes that adapts the EIG framework while using approaches from functional data analysis to mitigate the curse of dimensionality. We experimentally demonstrate that the resulting method outperforms a variant of EIG designed for visualization in terms of capturing the true structure, hyperparameter robustness, and computational speed. We then use our method to visualize EEG brain measurements of sleep activity. △ Less

Submitted 5 June, 2024; originally announced June 2024.

arXiv:2207.03093 [pdf, other]

Backpropagation on Dynamical Networks

Authors: Eugene Tan, Débora Corrêa, Thomas Stemler, Michael Small

Abstract: Dynamical networks are versatile models that can describe a variety of behaviours such as synchronisation and feedback. However, applying these models in real world contexts is difficult as prior information pertaining to the connectivity structure or local dynamics is often unknown and must be inferred from time series observations of network states. Additionally, the influence of coupling intera… ▽ More Dynamical networks are versatile models that can describe a variety of behaviours such as synchronisation and feedback. However, applying these models in real world contexts is difficult as prior information pertaining to the connectivity structure or local dynamics is often unknown and must be inferred from time series observations of network states. Additionally, the influence of coupling interactions between nodes further complicates the isolation of local node dynamics. Given the architectural similarities between dynamical networks and recurrent neural networks (RNN), we propose a network inference method based on the backpropagation through time (BPTT) algorithm commonly used to train recurrent neural networks. This method aims to simultaneously infer both the connectivity structure and local node dynamics purely from observation of node states. An approximation of local node dynamics is first constructed using a neural network. This is alternated with an adapted BPTT algorithm to regress corresponding network weights by minimising prediction errors of the dynamical network based on the previously constructed local models until convergence is achieved. This method was found to be succesful in identifying the connectivity structure for coupled networks of Lorenz, Chua and FitzHugh-Nagumo oscillators. Freerun prediction performance with the resulting local models and weights was found to be comparable to the true system with noisy initial conditions. The method is also extended to non-conventional network couplings such as asymmetric negative coupling. △ Less

Submitted 7 February, 2023; v1 submitted 7 July, 2022; originally announced July 2022.

arXiv:2107.03190 [pdf, ps, other]

Nested Counterfactual Identification from Arbitrary Surrogate Experiments

Authors: Juan D Correa, Sanghack Lee, Elias Bareinboim

Abstract: The Ladder of Causation describes three qualitatively different types of activities an agent may be interested in engaging in, namely, seeing (observational), doing (interventional), and imagining (counterfactual) (Pearl and Mackenzie, 2018). The inferential challenge imposed by the causal hierarchy is that data is collected by an agent observing or intervening in a system (layers 1 and 2), while… ▽ More The Ladder of Causation describes three qualitatively different types of activities an agent may be interested in engaging in, namely, seeing (observational), doing (interventional), and imagining (counterfactual) (Pearl and Mackenzie, 2018). The inferential challenge imposed by the causal hierarchy is that data is collected by an agent observing or intervening in a system (layers 1 and 2), while its goal may be to understand what would have happened had it taken a different course of action, contrary to what factually ended up happening (layer 3). While there exists a solid understanding of the conditions under which cross-layer inferences are allowed from observations to interventions, the results are somewhat scarcer when targeting counterfactual quantities. In this paper, we study the identification of nested counterfactuals from an arbitrary combination of observations and experiments. Specifically, building on a more explicit definition of nested counterfactuals, we prove the counterfactual unnesting theorem (CUT), which allows one to map arbitrary nested counterfactuals to unnested ones. For instance, applications in mediation and fairness analysis usually evoke notions of direct, indirect, and spurious effects, which naturally require nesting. Second, we introduce a sufficient and necessary graphical condition for counterfactual identification from an arbitrary combination of observational and experimental distributions. Lastly, we develop an efficient and complete algorithm for identifying nested counterfactuals; failure of the algorithm returning an expression for a query implies it is not identifiable. △ Less

Submitted 12 September, 2021; v1 submitted 7 July, 2021; originally announced July 2021.

arXiv:1706.04254 [pdf, ps, other]

Automatic Localization of Deep Stimulation Electrodes Using Trajectory-based Segmentation Approach

Authors: Roger Gomez Nieto, Andres Marino Alvarez Meza, Julian David Echeverry Correa, Alvaro Angel Orozco Gutierrez

Abstract: Parkinson's disease (PD) is a degenerative condition of the nervous system, which manifests itself primarily as muscle stiffness, hypokinesia, bradykinesia, and tremor. In patients suffering from advanced stages of PD, Deep Brain Stimulation neurosurgery (DBS) is the best alternative to medical treatment, especially when they become tolerant to the drugs. This surgery produces a neuronal activity,… ▽ More Parkinson's disease (PD) is a degenerative condition of the nervous system, which manifests itself primarily as muscle stiffness, hypokinesia, bradykinesia, and tremor. In patients suffering from advanced stages of PD, Deep Brain Stimulation neurosurgery (DBS) is the best alternative to medical treatment, especially when they become tolerant to the drugs. This surgery produces a neuronal activity, a result from electrical stimulation, whose quantification is known as Volume of Tissue Activated (VTA). To locate correctly the VTA in the cerebral volume space, one should be aware exactly the location of the tip of the DBS electrodes, as well as their spatial projection. In this paper, we automatically locate DBS electrodes using a threshold-based medical imaging segmentation methodology, determining the optimal value of this threshold adaptively. The proposed methodology allows the localization of DBS electrodes in Computed Tomography (CT) images, with high noise tolerance, using automatic threshold detection methods. △ Less

Submitted 13 June, 2017; originally announced June 2017.

Comments: 13 pages, 5 figures

arXiv:1603.07709 [pdf, ps, other]

Analyzing the Targets of Hate in Online Social Media

Authors: Leandro Silva, Mainack Mondal, Denzil Correa, Fabricio Benevenuto, Ingmar Weber

Abstract: Social media systems allow Internet users a congenial platform to freely express their thoughts and opinions. Although this property represents incredible and unique communication opportunities, it also brings along important challenges. Online hate speech is an archetypal example of such challenges. Despite its magnitude and scale, there is a significant gap in understanding the nature of hate sp… ▽ More Social media systems allow Internet users a congenial platform to freely express their thoughts and opinions. Although this property represents incredible and unique communication opportunities, it also brings along important challenges. Online hate speech is an archetypal example of such challenges. Despite its magnitude and scale, there is a significant gap in understanding the nature of hate speech on social media. In this paper, we provide the first of a kind systematic large scale measurement study of the main targets of hate speech in online social media. To do that, we gather traces from two social media systems: Whisper and Twitter. We then develop and validate a methodology to identify hate speech on both these systems. Our results identify online hate speech forms and offer a broader understanding of the phenomenon, providing directions for prevention and detection approaches. △ Less

Submitted 24 March, 2016; originally announced March 2016.

Comments: Short paper, 4 pages, 4 tables

arXiv:1412.6853 [pdf, other]

Musical elements in the discrete-time representation of sound

Authors: Renato Fabbri, Vilson Vieira da Silva Junior, Antônio Carlos Silvano Pessotti, Débora Cristina Corrêa, Osvaldo N. Oliveira Jr

Abstract: The representation of basic elements of music in terms of discrete audio signals is often used in software for musical creation and design. Nevertheless, there is no unified approach that relates these elements to the discrete samples of digitized sound. In this article, each musical element is related by equations and algorithms to the discrete-time samples of sounds, and each of these relations… ▽ More The representation of basic elements of music in terms of discrete audio signals is often used in software for musical creation and design. Nevertheless, there is no unified approach that relates these elements to the discrete samples of digitized sound. In this article, each musical element is related by equations and algorithms to the discrete-time samples of sounds, and each of these relations are implemented in scripts within a software toolbox, referred to as MASS (Music and Audio in Sample Sequences). The fundamental element, the musical note with duration, volume, pitch and timbre, is related quantitatively to characteristics of the digital signal. Internal variations of a note, such as tremolos, vibratos and spectral fluctuations, are also considered, which enables the synthesis of notes inspired by real instruments and new sonorities. With this representation of notes, resources are provided for the generation of higher scale musical structures, such as rhythmic meter, pitch intervals and cycles. This framework enables precise and trustful scientific experiments, data sonification and is useful for education and art. The efficacy of MASS is confirmed by the synthesis of small musical pieces using basic notes, elaborated notes and notes in music, which reflects the organization of the toolbox and thus of this article. It is possible to synthesize whole albums through collage of the scripts and settings specified by the user. With the open source paradigm, the toolbox can be promptly scrutinized, expanded in co-authorship processes and used with freedom by musicians, engineers and other interested parties. In fact, MASS has already been employed for diverse purposes which include music production, artistic presentations, psychoacoustic experiments and computer language diffusion where the appeal of audiovisual artifacts is exploited for education. △ Less

Submitted 26 October, 2017; v1 submitted 21 December, 2014; originally announced December 2014.

Comments: A software toolbox, a Python Package, musical pieces and further documents are in: https://github.com/ttm/mass

arXiv:1401.5163 [pdf]

doi 10.7321/jscse.v3.n3.10

Multi-hop Energy-efficient Control for Heterogeneous Wireless Sensor Networks Using Fuzzy Logic

Authors: Alexandre M Melo Silva, Christiano C Maciel, Suelene do Carmo Correa

Abstract: Wireless Sensor Networks (WSN) have severe energy constraints imposed by limited capacity of the internal battery of sensor nodes. These restrictions stimulate the development of energy-efficient strategies aimed at increasing the period of stability and lifetime of these networks. In this paper, we propose a centralized control to elect more appropriate Cluster Heads, assuming three levels of het… ▽ More Wireless Sensor Networks (WSN) have severe energy constraints imposed by limited capacity of the internal battery of sensor nodes. These restrictions stimulate the development of energy-efficient strategies aimed at increasing the period of stability and lifetime of these networks. In this paper, we propose a centralized control to elect more appropriate Cluster Heads, assuming three levels of heterogeneity and multi-hop communication between Cluster Heads. The centralized control uses the k-means algorithm, responsible for the division of clusters and Fuzzy Logic to elect the Cluster Head and selecting the best route of communication. The study results indicate that the proposed approach can increase the period of stability and lifetime in WSN. △ Less

Submitted 20 January, 2014; originally announced January 2014.

Comments: 2013 JSCSE

Report number: SCSE'13 2013

Journal ref: JSCSE Vol. 3, No. 3, 2013

arXiv:1401.0480 [pdf, other]

Chaff from the Wheat : Characterization and Modeling of Deleted Questions on Stack Overflow

Authors: Denzil Correa, Ashish Sureka

Abstract: Stack Overflow is the most popular CQA for programmers on the web with 2.05M users, 5.1M questions and 9.4M answers. Stack Overflow has explicit, detailed guidelines on how to post questions and an ebullient moderation community. Despite these precise communications and safeguards, questions posted on Stack Overflow can be extremely off topic or very poor in quality. Such questions can be deleted… ▽ More Stack Overflow is the most popular CQA for programmers on the web with 2.05M users, 5.1M questions and 9.4M answers. Stack Overflow has explicit, detailed guidelines on how to post questions and an ebullient moderation community. Despite these precise communications and safeguards, questions posted on Stack Overflow can be extremely off topic or very poor in quality. Such questions can be deleted from Stack Overflow at the discretion of experienced community members and moderators. We present the first study of deleted questions on Stack Overflow. We divide our study into two parts (i) Characterization of deleted questions over approx. 5 years (2008-2013) of data, (ii) Prediction of deletion at the time of question creation. Our characterization study reveals multiple insights on question deletion phenomena. We observe a significant increase in the number of deleted questions over time. We find that it takes substantial time to vote a question to be deleted but once voted, the community takes swift action. We also see that question authors delete their questions to salvage reputation points. We notice some instances of accidental deletion of good quality questions but such questions are voted back to be undeleted quickly. We discover a pyramidal structure of question quality on Stack Overflow and find that deleted questions lie at the bottom (lowest quality) of the pyramid. We also build a predictive model to detect the deletion of question at the creation time. We experiment with 47 features based on User Profile, Community Generated, Question Content and Syntactic style and report an accuracy of 66%. Our feature analysis reveals that all four categories of features are important for the prediction task. Our findings reveal important suggestions for content quality maintenance on community based question answering websites. △ Less

Submitted 2 January, 2014; originally announced January 2014.

Comments: 11 pages, Pre-print

arXiv:1307.7291 [pdf, ps, other]

Fit or Unfit : Analysis and Prediction of 'Closed Questions' on Stack Overflow

Authors: Denzil Correa, Ashish Sureka

Abstract: Stack Overflow is widely regarded as the most popular Community driven Question Answering (CQA) website for programmers. Questions posted on Stack Overflow which are not related to programming topics, are marked as 'closed' by experienced users and community moderators. A question can be 'closed' for five reasons - duplicate, off-topic, subjective, not a real question and too localized. In this wo… ▽ More Stack Overflow is widely regarded as the most popular Community driven Question Answering (CQA) website for programmers. Questions posted on Stack Overflow which are not related to programming topics, are marked as 'closed' by experienced users and community moderators. A question can be 'closed' for five reasons - duplicate, off-topic, subjective, not a real question and too localized. In this work, we present the first study of 'closed' questions in Stack Overflow. We download 4 years of publicly available data which contains 3.4 Million questions. We first analyze and characterize the complete set of 0.1 Million 'closed' questions. Next, we use a machine learning framework and build a predictive model to identify a 'closed' question at the time of question creation. One of our key findings is that despite being marked as 'closed', subjective questions contain high information value and are very popular with the users. We observe an increasing trend in the percentage of closed questions over time and find that this increase is positively correlated to the number of newly registered users. In addition, we also see a decrease in community participation to mark a 'closed' question which has led to an increase in moderation job time. We also find that questions closed with the Duplicate and Off Topic labels are relatively more prone to reputation gaming. For the 'closed' question prediction task, we make use of multiple genres of feature sets based on - user profile, community process, textual style and question content. We use a state-of-art machine learning classifier based on an ensemble learning technique and achieve an overall accuracy of 73%. To the best of our knowledge, this is the first experimental study to analyze and predict 'closed' questions on Stack Overflow. △ Less

Submitted 27 July, 2013; originally announced July 2013.

Comments: 13 pages, 14 figures, 10 tables, version 1.0

ACM Class: H.3.3; H.3.4; H.3.5

arXiv:1301.4916 [pdf, other]

Solutions to Detect and Analyze Online Radicalization : A Survey

Authors: Denzil Correa, Ashish Sureka

Abstract: Online Radicalization (also called Cyber-Terrorism or Extremism or Cyber-Racism or Cyber- Hate) is widespread and has become a major and growing concern to the society, governments and law enforcement agencies around the world. Research shows that various platforms on the Internet (low barrier to publish content, allows anonymity, provides exposure to millions of users and a potential of a very qu… ▽ More Online Radicalization (also called Cyber-Terrorism or Extremism or Cyber-Racism or Cyber- Hate) is widespread and has become a major and growing concern to the society, governments and law enforcement agencies around the world. Research shows that various platforms on the Internet (low barrier to publish content, allows anonymity, provides exposure to millions of users and a potential of a very quick and widespread diffusion of message) such as YouTube (a popular video sharing website), Twitter (an online micro-blogging service), Facebook (a popular social networking website), online discussion forums and blogosphere are being misused for malicious intent. Such platforms are being used to form hate groups, racist communities, spread extremist agenda, incite anger or violence, promote radicalization, recruit members and create virtual organi- zations and communities. Automatic detection of online radicalization is a technically challenging problem because of the vast amount of the data, unstructured and noisy user-generated content, dynamically changing content and adversary behavior. There are several solutions proposed in the literature aiming to combat and counter cyber-hate and cyber-extremism. In this survey, we review solutions to detect and analyze online radicalization. We review 40 papers published at 12 venues from June 2003 to November 2011. We present a novel classification scheme to classify these papers. We analyze these techniques, perform trend analysis, discuss limitations of existing techniques and find out research gaps. △ Less

Submitted 21 January, 2013; originally announced January 2013.

ACM Class: A.1

arXiv:0911.3842 [pdf, other]

doi 10.1088/1367-2630/12/5/053030

Musical Genres: Beating to the Rhythms of Different Drums

Authors: Debora C. Correa, Jose H. Saito, Luciano da F. Costa

Abstract: Online music databases have increased signicantly as a consequence of the rapid growth of the Internet and digital audio, requiring the development of faster and more efficient tools for music content analysis. Musical genres are widely used to organize music collections. In this paper, the problem of automatic music genre classification is addressed by exploring rhythm-based features obtained f… ▽ More Online music databases have increased signicantly as a consequence of the rapid growth of the Internet and digital audio, requiring the development of faster and more efficient tools for music content analysis. Musical genres are widely used to organize music collections. In this paper, the problem of automatic music genre classification is addressed by exploring rhythm-based features obtained from a respective complex network representation. A Markov model is build in order to analyse the temporal sequence of rhythmic notation events. Feature analysis is performed by using two multivariate statistical approaches: principal component analysis(unsupervised) and linear discriminant analysis (supervised). Similarly, two classifiers are applied in order to identify the category of rhythms: parametric Bayesian classifier under gaussian hypothesis (supervised), and agglomerative hierarchical clustering (unsupervised). Qualitative results obtained by Kappa coefficient and the obtained clusters corroborated the effectiveness of the proposed method. △ Less

Submitted 19 November, 2009; originally announced November 2009.

Comments: 35 pages, 13 figures, 13 tables

Showing 1–11 of 11 results for author: Correa, D