-
Refinement of an Epilepsy Dictionary through Human Annotation of Health-related posts on Instagram
Authors:
Aehong Min,
Xuan Wang,
Rion Brattig Correia,
Jordan Rozum,
Wendy R. Miller,
Luis M. Rocha
Abstract:
We used a dictionary built from biomedical terminology extracted from various sources such as DrugBank, MedDRA, MedlinePlus, TCMGeneDIT, to tag more than 8 million Instagram posts by users who have mentioned an epilepsy-relevant drug at least once, between 2010 and early 2016. A random sample of 1,771 posts with 2,947 term matches was evaluated by human annotators to identify false-positives. Open…
▽ More
We used a dictionary built from biomedical terminology extracted from various sources such as DrugBank, MedDRA, MedlinePlus, TCMGeneDIT, to tag more than 8 million Instagram posts by users who have mentioned an epilepsy-relevant drug at least once, between 2010 and early 2016. A random sample of 1,771 posts with 2,947 term matches was evaluated by human annotators to identify false-positives. OpenAI's GPT series models were compared against human annotation. Frequent terms with a high false-positive rate were removed from the dictionary. Analysis of the estimated false-positive rates of the annotated terms revealed 8 ambiguous terms (plus synonyms) used in Instagram posts, which were removed from the original dictionary. To study the effect of removing those terms, we constructed knowledge networks using the refined and the original dictionaries and performed an eigenvector-centrality analysis on both networks. We show that the refined dictionary thus produced leads to a significantly different rank of important terms, as measured by their eigenvector-centrality of the knowledge networks. Furthermore, the most important terms obtained after refinement are of greater medical relevance. In addition, we show that OpenAI's GPT series models fare worse than human annotators in this task.
△ Less
Submitted 14 May, 2024;
originally announced May 2024.
-
Selecting focused digital cohorts from social media using the metric backbone of biomedical knowledge graphs
Authors:
Ziqi Guo,
Jack Felag,
Jordan C. Rozum,
Rion Brattig Correia,
Luis M. Rocha
Abstract:
The abundance of social media data allows researchers to construct large digital cohorts to study the interplay between human behavior and medical treatment. Identifying the users most relevant to a specific health problem is, however, a challenge in that social media sites vary in the generality of their discourse. While X (formerly Twitter), Instagram, and Facebook cater to wide ranging topics,…
▽ More
The abundance of social media data allows researchers to construct large digital cohorts to study the interplay between human behavior and medical treatment. Identifying the users most relevant to a specific health problem is, however, a challenge in that social media sites vary in the generality of their discourse. While X (formerly Twitter), Instagram, and Facebook cater to wide ranging topics, Reddit subgroups and dedicated patient advocacy forums trade in much more specific, biomedically-relevant discourse. To hone in on relevant users anywhere, we have developed a general framework and applied it to epilepsy discourse in social media as a test case. We analyzed the text from posts by users who mention epilepsy drugs in the general-purpose social media sites X and Instagram, the epilepsy-focused Reddit subgroup (r/Epilepsy), and the Epilepsy Foundation of America (EFA) forums. We curated a medical terms dictionary and used it to generate a knowledge graph (KG) for each online community. For each KG, we computed the metric backbone--the smallest subgraph that preserves all shortest paths in the network. By comparing the subset of users who contribute to the backbone to the subset who do not, we found that epilepsy-focused social media users contribute to the KG backbone in much higher proportion than do general-purpose social media users. Furthermore, using human annotation of Instagram posts, we demonstrated that users who do not contribute to the backbone are more than twice as likely to use dictionary terms in a manner inconsistent with their biomedical meaning. For biomedical research applications, our backbone-based approach thus has several benefits over simple engagement-based approaches: It can retain low-engagement users who nonetheless contribute meaningful biomedical insights. It can filter out very vocal users who contribute no relevant content.
△ Less
Submitted 11 May, 2024;
originally announced May 2024.
-
myAURA: Personalized health library for epilepsy management via knowledge graph sparsification and visualization
Authors:
Rion Brattig Correia,
Jordan C. Rozum,
Leonard Cross,
Jack Felag,
Michael Gallant,
Ziqi Guo,
Bruce W. Herr II,
Aehong Min,
Deborah Stungis Rocha,
Xuan Wang,
Katy Börner,
Wendy Miller,
Luis M. Rocha
Abstract:
Objective: We report the development of the patient-centered myAURA application and suite of methods designed to aid epilepsy patients, caregivers, and researchers in making decisions about care and self-management.
Materials and Methods: myAURA rests on the federation of an unprecedented collection of heterogeneous data resources relevant to epilepsy, such as biomedical databases, social media,…
▽ More
Objective: We report the development of the patient-centered myAURA application and suite of methods designed to aid epilepsy patients, caregivers, and researchers in making decisions about care and self-management.
Materials and Methods: myAURA rests on the federation of an unprecedented collection of heterogeneous data resources relevant to epilepsy, such as biomedical databases, social media, and electronic health records. A generalizable, open-source methodology was developed to compute a multi-layer knowledge graph linking all this heterogeneous data via the terms of a human-centered biomedical dictionary.
Results: The power of the approach is first exemplified in the study of the drug-drug interaction phenomenon. Furthermore, we employ a novel network sparsification methodology using the metric backbone of weighted graphs, which reveals the most important edges for inference, recommendation, and visualization, such as pharmacology factors patients discuss on social media. The network sparsification approach also allows us to extract focused digital cohorts from social media whose discourse is more relevant to epilepsy or other biomedical problems. Finally, we present our patient-centered design and pilot-testing of myAURA, including its user interface, based on focus groups and other stakeholder input.
Discussion: The ability to search and explore myAURA's heterogeneous data sources via a sparsified multi-layer knowledge graph, as well as the combination of those layers in a single map, are useful features for integrating relevant information for epilepsy.
Conclusion: Our stakeholder-driven, scalable approach to integrate traditional and non-traditional data sources, enables biomedical discovery and data-powered patient self-management in epilepsy, and is generalizable to other chronic conditions.
△ Less
Submitted 10 May, 2024; v1 submitted 8 May, 2024;
originally announced May 2024.
-
The ultrametric backbone is the union of all minimum spanning forests
Authors:
Jordan C Rozum,
Luis M Rocha
Abstract:
Minimum spanning trees and forests are powerful sparsification techniques that remove cycles from weighted graphs to minimize total edge weight while preserving node connectivity. They have applications in computer science, network science, and graph theory. Despite their utility and ubiquity, they have several limitations, including that they are only defined for undirected networks, they signifi…
▽ More
Minimum spanning trees and forests are powerful sparsification techniques that remove cycles from weighted graphs to minimize total edge weight while preserving node connectivity. They have applications in computer science, network science, and graph theory. Despite their utility and ubiquity, they have several limitations, including that they are only defined for undirected networks, they significantly alter dynamics on networks, and they do not generally preserve important network features such as shortest distances, shortest path distribution, and community structure. In contrast, distance backbones, which are subgraphs formed by all edges that obey a generalized triangle inequality, are well defined in both directed and undirected graphs and preserve those and other important network features. The backbone of a graph is defined with respect to a specified path-length operator that aggregates weights along a path to define its length, thereby associating a cost to indirect connections. The backbone is the union of all shortest paths between each pair of nodes according to the specified operator. One such operator, the max function, computes the length of a path as the largest weight of the edges that compose it (a weakest link criterion). It is the only operator that yields an algebraic structure for computing shortest paths that is consistent with De Morgan's laws. Applying this operator yields the ultrametric backbone of a graph in that (semi-triangular) edges whose weights are larger than the length of an indirect path connecting the same nodes (i.e., those that break the generalized triangle inequality based on max as a path-length operator) are removed. We show that the ultrametric backbone is the union of all minimum spanning forests in undirected graphs and provides a new generalization of minimum spanning trees to directed graphs.
△ Less
Submitted 22 March, 2024; v1 submitted 19 March, 2024;
originally announced March 2024.
-
Dynamic Q-planning for Online UAV Path Planning in Unknown and Complex Environments
Authors:
Lidia Gianne Souza da Rocha,
Kenny Anderson Queiroz Caldas,
Marco Henrique Terra,
Fabio Ramos,
Kelen Cristiane Teixeira Vivaldini
Abstract:
Unmanned Aerial Vehicles need an online path planning capability to move in high-risk missions in unknown and complex environments to complete them safely. However, many algorithms reported in the literature may not return reliable trajectories to solve online problems in these scenarios. The Q-Learning algorithm, a Reinforcement Learning Technique, can generate trajectories in real-time and has d…
▽ More
Unmanned Aerial Vehicles need an online path planning capability to move in high-risk missions in unknown and complex environments to complete them safely. However, many algorithms reported in the literature may not return reliable trajectories to solve online problems in these scenarios. The Q-Learning algorithm, a Reinforcement Learning Technique, can generate trajectories in real-time and has demonstrated fast and reliable results. This technique, however, has the disadvantage of defining the iteration number. If this value is not well defined, it will take a long time or not return an optimal trajectory. Therefore, we propose a method to dynamically choose the number of iterations to obtain the best performance of Q-Learning. The proposed method is compared to the Q-Learning algorithm with a fixed number of iterations, A*, Rapid-Exploring Random Tree, and Particle Swarm Optimization. As a result, the proposed Q-learning algorithm demonstrates the efficacy and reliability of online path planning with a dynamic number of iterations to carry out online missions in unknown and complex environments.
△ Less
Submitted 9 February, 2024;
originally announced February 2024.
-
Semi-metric topology characterizes epidemic spreading on complex networks
Authors:
David Soriano Paños,
Felipe Xavier Costa,
Luis M. Rocha
Abstract:
Network sparsification represents an essential tool to extract the core of interactions sustaining both networks dynamics and their connectedness. In the case of infectious diseases, network sparsification methods remove irrelevant connections to unveil the primary subgraph driving the unfolding of epidemic outbreaks in real networks. In this paper, we explore the features determining whether the…
▽ More
Network sparsification represents an essential tool to extract the core of interactions sustaining both networks dynamics and their connectedness. In the case of infectious diseases, network sparsification methods remove irrelevant connections to unveil the primary subgraph driving the unfolding of epidemic outbreaks in real networks. In this paper, we explore the features determining whether the metric backbone, a subgraph capturing the structure of shortest paths across a network, allows reconstructing epidemic outbreaks. We find that both the relative size of the metric backbone, capturing the fraction of edges kept in such structure, and the distortion of semi-metric edges, quantifying how far those edges not included in the metric backbone are from their associated shortest path, shape the retrieval of Susceptible-Infected (SI) dynamics. We propose a new method to progressively dismantle networks relying on the semi-metric edge distortion, removing first those connections farther from those included in the metric backbone, i.e. those with highest semi-metric distortion values. We apply our method in both synthetic and real networks, finding that semi-metric distortion provides solid ground to preserve spreading dynamics and connectedness while sparsifying networks.
△ Less
Submitted 24 November, 2023;
originally announced November 2023.
-
Offline Metrics for Evaluating Explanation Goals in Recommender Systems
Authors:
André Levi Zanon,
Marcelo Garcia Manzato,
Leonardo Rocha
Abstract:
Explanations are crucial for improving users' transparency, persuasiveness, engagement, and trust in Recommender Systems (RSs). However, evaluating the effectiveness of explanation algorithms regarding those goals remains challenging due to existing offline metrics' limitations. This paper introduces new metrics for the evaluation and validation of explanation algorithms based on the items and pro…
▽ More
Explanations are crucial for improving users' transparency, persuasiveness, engagement, and trust in Recommender Systems (RSs). However, evaluating the effectiveness of explanation algorithms regarding those goals remains challenging due to existing offline metrics' limitations. This paper introduces new metrics for the evaluation and validation of explanation algorithms based on the items and properties used to form the sentence of an explanation. Towards validating the metrics, the results of three state-of-the-art post-hoc explanation algorithms were evaluated for six RSs, comparing the offline metrics results with those of an online user study. The findings show the proposed offline metrics can effectively measure the performance of explanation algorithms and highlight a trade-off between the goals of transparency and trust, which are related to popular properties, and the goals of engagement and persuasiveness, which are associated with the diversification of properties displayed to users. Furthermore, the study contributes to the development of more robust evaluation methods for explanation algorithms in RSs.
△ Less
Submitted 22 October, 2023;
originally announced October 2023.
-
TPDR: A Novel Two-Step Transformer-based Product and Class Description Match and Retrieval Method
Authors:
Washington Cunha,
Celso França,
Leonardo Rocha,
Marcos André Gonçalves
Abstract:
There is a niche of companies responsible for intermediating the purchase of large batches of varied products for other companies, for which the main challenge is to perform product description standardization, i.e., matching an item described by a client with a product described in a catalog. The problem is complex since the client's product description may be: (1) potentially noisy; (2) short an…
▽ More
There is a niche of companies responsible for intermediating the purchase of large batches of varied products for other companies, for which the main challenge is to perform product description standardization, i.e., matching an item described by a client with a product described in a catalog. The problem is complex since the client's product description may be: (1) potentially noisy; (2) short and uninformative (e.g., missing information about model and size); and (3) cross-language. In this paper, we formalize this problem as a ranking task: given an initial client product specification (query), return the most appropriate standardized descriptions (response). In this paper, we propose TPDR, a two-step Transformer-based Product and Class Description Retrieval method that is able to explore the semantic correspondence between IS and SD, by exploiting attention mechanisms and contrastive learning. First, TPDR employs the transformers as two encoders sharing the embedding vector space: one for encoding the IS and another for the SD, in which corresponding pairs (IS, SD) must be close in the vector space. Closeness is further enforced by a contrastive learning mechanism leveraging a specialized loss function. TPDR also exploits a (second) re-ranking step based on syntactic features that are very important for the exact matching (model, dimension) of certain products that may have been neglected by the transformers. To evaluate our proposal, we consider 11 datasets from a real company, covering different application contexts. Our solution was able to retrieve the correct standardized product before the 5th ranking position in 71% of the cases and its correct category in the first position in 80% of the situations. Moreover, the effectiveness gains over purely syntactic or semantic baselines reach up to 3.7 times, solving cases that none of the approaches in isolation can do by themselves.
△ Less
Submitted 5 October, 2023;
originally announced October 2023.
-
Fast but multi-partisan: Bursts of communication increase opinion diversity in the temporal Deffuant model
Authors:
Fatemeh Zarei,
Yerali Gandica,
Luis Enrique Correa Rocha
Abstract:
Human interactions create social networks forming the backbone of societies. Individuals adjust their opinions by exchanging information through social interactions. Two recurrent questions are whether social structures promote opinion polarisation or consensus in societies and whether polarisation can be avoided, particularly on social media. In this paper, we hypothesise that not only network st…
▽ More
Human interactions create social networks forming the backbone of societies. Individuals adjust their opinions by exchanging information through social interactions. Two recurrent questions are whether social structures promote opinion polarisation or consensus in societies and whether polarisation can be avoided, particularly on social media. In this paper, we hypothesise that not only network structure but also the timings of social interactions regulate the emergence of opinion clusters. We devise a temporal version of the Deffuant opinion model where pairwise interactions follow temporal patterns and show that burstiness alone is sufficient to refrain from consensus and polarisation by promoting the reinforcement of local opinions. Individuals self-organise into a multi-partisan society due to network clustering, but the diversity of opinion clusters further increases with burstiness, particularly when individuals have low tolerance and prefer to adjust to similar peers. The emergent opinion landscape is well-balanced regarding clusters' size, with a small fraction of individuals converging to extreme opinions. We thus argue that polarisation is more likely to emerge in social media than offline social networks because of the relatively low social clustering observed online. Counter-intuitively, strengthening online social networks by increasing social redundancy may be a venue to reduce polarisation and promote opinion diversity.
△ Less
Submitted 28 July, 2023;
originally announced July 2023.
-
Dynamical Modularity in Automata Models of Biochemical Networks
Authors:
Thomas Parmer,
Luis M. Rocha
Abstract:
Given the large size and complexity of most biochemical regulation and signaling networks, there is a non-trivial relationship between the micro-level logic of component interactions and the observed macro-dynamics. Here we address this issue by formalizing the existing concept of pathway modules, which are sequences of state updates that are guaranteed to occur (barring outside interference) in t…
▽ More
Given the large size and complexity of most biochemical regulation and signaling networks, there is a non-trivial relationship between the micro-level logic of component interactions and the observed macro-dynamics. Here we address this issue by formalizing the existing concept of pathway modules, which are sequences of state updates that are guaranteed to occur (barring outside interference) in the dynamics of automata networks after the perturbation of a subset of driver nodes. We present a novel algorithm to automatically extract pathway modules from networks and we characterize the interactions that may take place between modules. This methodology uses only the causal logic of individual node variables (micro-dynamics) without the need to compute the dynamical landscape of the networks (macro-dynamics). Specifically, we identify complex modules, which maximize pathway length and require synergy between their components. This allows us to propose a new take on dynamical modularity that partitions complex networks into causal pathways of variables that are guaranteed to transition to specific states given a perturbation to a set of driver nodes. Thus, the same node variable can take part in distinct modules depending on the state it takes. Our measure of dynamical modularity of a network is then inversely proportional to the overlap among complex modules and maximal when complex modules are completely decouplable from one another in the network dynamics. We estimate dynamical modularity for several genetic regulatory networks, including the Drosophila melanogaster segment-polarity network. We discuss how identifying complex modules and the dynamical modularity portrait of networks explains the macro-dynamics of biological networks, such as uncovering the (more or less) decouplable building blocks of emergent computation (or collective behavior) in biochemical regulation and signaling.
△ Less
Submitted 17 April, 2023; v1 submitted 28 March, 2023;
originally announced March 2023.
-
Carolina: a General Corpus of Contemporary Brazilian Portuguese with Provenance, Typology and Versioning Information
Authors:
Maria Clara Ramos Morales Crespo,
Maria Lina de Souza Jeannine Rocha,
Mariana Lourenço Sturzeneker,
Felipe Ribas Serras,
Guilherme Lamartine de Mello,
Aline Silva Costa,
Mayara Feliciano Palma,
Renata Morais Mesquita,
Raquel de Paula Guets,
Mariana Marques da Silva,
Marcelo Finger,
Maria Clara Paixão de Sousa,
Cristiane Namiuti,
Vanessa Martins do Monte
Abstract:
This paper presents the first publicly available version of the Carolina Corpus and discusses its future directions. Carolina is a large open corpus of Brazilian Portuguese texts under construction using web-as-corpus methodology enhanced with provenance, typology, versioning, and text integrality. The corpus aims at being used both as a reliable source for research in Linguistics and as an import…
▽ More
This paper presents the first publicly available version of the Carolina Corpus and discusses its future directions. Carolina is a large open corpus of Brazilian Portuguese texts under construction using web-as-corpus methodology enhanced with provenance, typology, versioning, and text integrality. The corpus aims at being used both as a reliable source for research in Linguistics and as an important resource for Computer Science research on language models, contributing towards removing Portuguese from the set of low-resource languages. Here we present the construction of the corpus methodology, comparing it with other existing methodologies, as well as the corpus current state: Carolina's first public version has $653,322,577$ tokens, distributed over $7$ broad types. Each text is annotated with several different metadata categories in its header, which we developed using TEI annotation standards. We also present ongoing derivative works and invite NLP researchers to contribute with their own.
△ Less
Submitted 28 March, 2023;
originally announced March 2023.
-
The distance backbone of directed networks
Authors:
Felipe Xavier Costa,
Rion Brattig Correia,
Luis M. Rocha
Abstract:
In weighted graphs the shortest path between two nodes is often reached through an indirect path, out of all possible connections, leading to structural redundancies which play key roles in the dynamics and evolution of complex networks. We have previously developed a parameter-free, algebraically-principled methodology to uncover such redundancy and reveal the distance backbone of weighted graphs…
▽ More
In weighted graphs the shortest path between two nodes is often reached through an indirect path, out of all possible connections, leading to structural redundancies which play key roles in the dynamics and evolution of complex networks. We have previously developed a parameter-free, algebraically-principled methodology to uncover such redundancy and reveal the distance backbone of weighted graphs, which has been shown to be important in transmission dynamics, inference of important paths, and quantifying the robustness of networks. However, the method was developed for undirected graphs. Here we expand this methodology to weighted directed graphs and study the redundancy and robustness found in nine networks ranging from social, biomedical, and technical systems. We found that similarly to undirected graphs, directed graphs in general also contain a large amount of redundancy, as measured by the size of their (directed) distance backbone. Our methodology adds an additional tool to the principled sparsification of complex networks and the measure of their robustness.
△ Less
Submitted 2 September, 2022;
originally announced September 2022.
-
LargeNetVis: Visual Exploration of Large Temporal Networks Based on Community Taxonomies
Authors:
Claudio D. G. Linhares,
Jean R. Ponciano,
Diogenes S. Pedro,
Luis E. C. Rocha,
Agma J. M. Traina,
Jorge Poco
Abstract:
Temporal (or time-evolving) networks are commonly used to model complex systems and the evolution of their components throughout time. Although these networks can be analyzed by different means, visual analytics stands out as an effective way for a pre-analysis before doing quantitative/statistical analyses to identify patterns, anomalies, and other behaviors in the data, thus leading to new insig…
▽ More
Temporal (or time-evolving) networks are commonly used to model complex systems and the evolution of their components throughout time. Although these networks can be analyzed by different means, visual analytics stands out as an effective way for a pre-analysis before doing quantitative/statistical analyses to identify patterns, anomalies, and other behaviors in the data, thus leading to new insights and better decision-making. However, the large number of nodes, edges, and/or timestamps in many real-world networks may lead to polluted layouts that make the analysis inefficient or even infeasible. In this paper, we propose LargeNetVis, a web-based visual analytics system designed to assist in analyzing small and large temporal networks. It successfully achieves this goal by leveraging three taxonomies focused on network communities to guide the visual exploration process. The system is composed of four interactive visual components: the first (Taxonomy Matrix) presents a summary of the network characteristics, the second (Global View) gives an overview of the network evolution, the third (a node-link diagram) enables community- and node-level structural analysis, and the fourth (a Temporal Activity Map -- TAM) shows the community- and node-level activity under a temporal perspective.
△ Less
Submitted 8 August, 2022;
originally announced August 2022.
-
Evolution of the public opinion on COVID-19 vaccination in Japan
Authors:
Yuri Nakayama,
Yuka Takedomi,
Towa Suda,
Takeaki Uno,
Takako Hashimoto,
Masashi Toyoda,
Naoki Yoshinaga,
Masaru Kitsuregawa,
Luis E. C. Rocha,
Ryota Kobayashi
Abstract:
Vaccines are promising tools to control the spread of COVID-19. An effective vaccination campaign requires government policies and community engagement, sharing experiences for social support, and voicing concerns to vaccine safety and efficiency. The increasing use of online social platforms allows us to trace large-scale communication and infer public opinion in real-time. We collected more than…
▽ More
Vaccines are promising tools to control the spread of COVID-19. An effective vaccination campaign requires government policies and community engagement, sharing experiences for social support, and voicing concerns to vaccine safety and efficiency. The increasing use of online social platforms allows us to trace large-scale communication and infer public opinion in real-time. We collected more than 100 million vaccine-related tweets posted by 8 million users and used the Latent Dirichlet Allocation model to perform automated topic modeling of tweet texts during the vaccination campaign in Japan. We identified 15 topics grouped into 4 themes on Personal issue, Breaking news, Politics, and Conspiracy and humour. The evolution of the popularity of themes revealed a shift in public opinion, initially sharing the attention over personal issues (individual aspect), collecting information from the news (knowledge acquisition), and government criticisms, towards personal experiences once confidence in the vaccination campaign was established. An interrupted time series regression analysis showed that the Tokyo Olympic Games affected public opinion more than other critical events but not the course of the vaccination. Public opinion on politics was significantly affected by various events, positively shifting the attention in the early stages of the vaccination campaign and negatively later. Tweets about personal issues were mostly retweeted when the vaccination reached the younger population. The associations between the vaccination campaign stages and tweet themes suggest that the public engagement in the social platform contributed to speedup vaccine uptake by reducing anxiety via social learning and support.
△ Less
Submitted 22 July, 2022;
originally announced July 2022.
-
Neuroimaging Feature Extraction using a Neural Network Classifier for Imaging Genetics
Authors:
Cédric Beaulac,
Sidi Wu,
Erin Gibson,
Michelle F. Miranda,
Jiguo Cao,
Leno Rocha,
Mirza Faisal Beg,
Farouk S. Nathoo
Abstract:
A major issue in the association of genes to neuroimaging phenotypes is the high dimension of both genetic data and neuroimaging data. In this article, we tackle the latter problem with an eye toward develo** solutions that are relevant for disease prediction. Supported by a vast literature on the predictive power of neural networks, our proposed solution uses neural networks to extract from neu…
▽ More
A major issue in the association of genes to neuroimaging phenotypes is the high dimension of both genetic data and neuroimaging data. In this article, we tackle the latter problem with an eye toward develo** solutions that are relevant for disease prediction. Supported by a vast literature on the predictive power of neural networks, our proposed solution uses neural networks to extract from neuroimaging data features that are relevant for predicting Alzheimer's Disease (AD) for subsequent relation to genetics. Our neuroimaging-genetic pipeline is comprised of image processing, neuroimaging feature extraction and genetic association steps. We propose a neural network classifier for extracting neuroimaging features that are related with disease and a multivariate Bayesian group sparse regression model for genetic association. We compare the predictive power of these features to expert selected features and take a closer look at the SNPs identified with the new neuroimaging features.
△ Less
Submitted 8 July, 2022;
originally announced July 2022.
-
Small Cohort of Epilepsy Patients Showed Increased Activity on Facebook before Sudden Unexpected Death
Authors:
Ian B. Wood,
Rion Brattig Correia,
Wendy R. Miller,
Luis M. Rocha
Abstract:
Sudden Unexpected Death in Epilepsy (SUDEP) remains a leading cause of death in people with epilepsy. Despite the constant risk for patients and bereavement to family members, to date the physiological mechanisms of SUDEP remain unknown. Here we explore the potential to identify putative predictive signals of SUDEP from online digital behavioral data using text and sentiment analysis. Specifically…
▽ More
Sudden Unexpected Death in Epilepsy (SUDEP) remains a leading cause of death in people with epilepsy. Despite the constant risk for patients and bereavement to family members, to date the physiological mechanisms of SUDEP remain unknown. Here we explore the potential to identify putative predictive signals of SUDEP from online digital behavioral data using text and sentiment analysis. Specifically, we analyze Facebook timelines of six epilepsy patients deceased due to SUDEP, donated by surviving family members. We find preliminary evidence for behavioral changes detectable by text and sentiment analysis tools. Namely, in the months preceding their SUDEP event patient social media timelines show: i) increase in verbosity; ii) increased use of functional words; and iii) sentiment shifts as measured by different sentiment analysis tools. Combined, these results suggest that social media engagement, as well as its sentiment, may serve as possible early-warning signals for SUDEP in people with epilepsy. While the small sample of patient timelines analyzed in this study prevents generalization, our preliminary investigation demonstrates the potential of social media data as complementary data in larger studies of SUDEP and epilepsy.
△ Less
Submitted 19 January, 2022;
originally announced January 2022.
-
Developers perception on the severity of test smells: an empirical study
Authors:
Denivan Campos,
Larissa Rocha,
Ivan Machado
Abstract:
Unit testing is an essential component of the software development life-cycle. A developer could easily and quickly catch and fix software faults introduced in the source code by creating and running unit tests. Despite their importance, unit tests are subject to bad design or implementation decisions, the so-called test smells. These might decrease software systems quality from various aspects, m…
▽ More
Unit testing is an essential component of the software development life-cycle. A developer could easily and quickly catch and fix software faults introduced in the source code by creating and running unit tests. Despite their importance, unit tests are subject to bad design or implementation decisions, the so-called test smells. These might decrease software systems quality from various aspects, making it harder to understand, more complex to maintain, and more prone to errors and bugs. Many studies discuss the likely effects of test smells on test code. However, there is a lack of studies that capture developers perceptions of such issues. This study empirically analyzes how developers perceive the severity of test smells in the test code they develop. Severity refers to the degree to how a test smell may negatively impact the test code. We selected six open-source software projects from GitHub and interviewed their developers to understand whether and how the test smells affected the test code. Although most of the interviewed developers considered the test smells as having a low severity to their code, they indicated that test smells might negatively impact the project, particularly in test code maintainability and evolution. Also, detecting and removing test smells from the test code may be positive for the project.
△ Less
Submitted 29 July, 2021;
originally announced July 2021.
-
Autonomous Navigation System for a Delivery Drone
Authors:
Victor R. F. Miranda,
Adriano M. C. Rezende,
Thiago L. Rocha,
Héctor Azpúrua,
Luciano C. A. Pimenta,
Gustavo M. Freitas
Abstract:
The use of delivery services is an increasing trend worldwide, further enhanced by the COVID pandemic. In this context, drone delivery systems are of great interest as they may allow for faster and cheaper deliveries. This paper presents a navigation system that makes feasible the delivery of parcels with autonomous drones. The system generates a path between a start and a final point and controls…
▽ More
The use of delivery services is an increasing trend worldwide, further enhanced by the COVID pandemic. In this context, drone delivery systems are of great interest as they may allow for faster and cheaper deliveries. This paper presents a navigation system that makes feasible the delivery of parcels with autonomous drones. The system generates a path between a start and a final point and controls the drone to follow this path based on its localization obtained through GPS, 9DoF IMU, and barometer. In the landing phase, information of poses estimated by a marker (ArUco) detection technique using a camera, ultra-wideband (UWB) devices, and the drone's software estimation are merged by utilizing an Extended Kalman Filter algorithm to improve the landing precision. A vector field-based method controls the drone to follow the desired path smoothly, reducing vibrations or harsh movements that could harm the transported parcel. Real experiments validate the delivery strategy and allow to evaluate the performance of the adopted techniques. Preliminary results state the viability of our proposal for autonomous drone delivery.
△ Less
Submitted 16 June, 2021;
originally announced June 2021.
-
From Blackboard to the Office: A Look Into How Practitioners Perceive Software Testing Education
Authors:
Luana Martins,
Vinicius Brito,
Daniela Feitosa,
Larissa Rocha,
Heitor Costa,
Ivan Machado
Abstract:
The teaching-learning process may require specific pedagogical approaches to establish a relationship with industry practices. Recently, some studies investigated the educators' perspectives and the undergraduate courses curriculum to identify potential weaknesses and solutions for the software testing teaching process. However, it is still unclear how the practitioners evaluate the acquisition of…
▽ More
The teaching-learning process may require specific pedagogical approaches to establish a relationship with industry practices. Recently, some studies investigated the educators' perspectives and the undergraduate courses curriculum to identify potential weaknesses and solutions for the software testing teaching process. However, it is still unclear how the practitioners evaluate the acquisition of knowledge about software testing in undergraduate courses. This study carried out an expert survey with 68 newly graduated practitioners to determine what the industry expects from them and what they learned in academia. The yielded results indicated that those practitioners learned at a similar rate as others with a long industry experience. Also, they studied less than half of the 35 software testing topics collected in the survey and took industry-backed extracurricular courses to complement their learning. Additionally, our findings point out a set of implications for future research, as the respondents' learning difficulties (e.g., lack of learning sources) and the gap between academic education and industry expectations (e.g., certifications).
△ Less
Submitted 11 June, 2021;
originally announced June 2021.
-
Assessing Exception Handling Testing Practices in Open-Source Libraries
Authors:
Luan P. Lima,
Lincoln S. Rocha,
Carla I. M. Bezerra,
Matheus Paixao
Abstract:
Modern programming languages (e.g., Java and C#) provide features to separate error-handling code from regular code, seeking to enhance software comprehensibility and maintainability. Nevertheless, the way exception handling (EH) code is structured in such languages may lead to multiple, different, and complex control flows, which may affect the software testability. Previous studies have reported…
▽ More
Modern programming languages (e.g., Java and C#) provide features to separate error-handling code from regular code, seeking to enhance software comprehensibility and maintainability. Nevertheless, the way exception handling (EH) code is structured in such languages may lead to multiple, different, and complex control flows, which may affect the software testability. Previous studies have reported that EH code is typically neglected, not well tested, and its misuse can lead to reliability degradation and catastrophic failures. However, little is known about the relationship between testing practices and EH testing effectiveness. In this exploratory study, we (i) measured the adequacy degree of EH testing concerning code coverage (instruction, branch, and method) criteria; and (ii) evaluated the effectiveness of the EH testing by measuring its capability to detect artificially injected faults (i.e., mutants) using 7 EH mutation operators. Our study was performed using test suites of 27 long-lived Java libraries from open-source ecosystems. Our results show that instructions and branches within $\mathtt{catch}$ blocks and $\mathtt{throw}$ instructions are less covered, with statistical significance than the overall instructions and branches. Nevertheless, most of the studied libraries presented test suites capable of detecting more than 70% of the injected faults. From a total of 12,331 mutants created in this study, the test suites were able to detect 68% of them.
△ Less
Submitted 2 May, 2021;
originally announced May 2021.
-
The distance backbone of complex networks
Authors:
Tiago Simas,
Rion Brattig Correia,
Luis M. Rocha
Abstract:
Redundancy needs more precise characterization as it is a major factor in the evolution and robustness of networks of multivariate interactions. We investigate the complexity of such interactions by inferring a connection transitivity that includes all possible measures of path length for weighted graphs. The result, without breaking the graph into smaller components, is a distance backbone subgra…
▽ More
Redundancy needs more precise characterization as it is a major factor in the evolution and robustness of networks of multivariate interactions. We investigate the complexity of such interactions by inferring a connection transitivity that includes all possible measures of path length for weighted graphs. The result, without breaking the graph into smaller components, is a distance backbone subgraph sufficient to compute all shortest paths. This is important for understanding the dynamics of spread and communication phenomena in real-world networks. The general methodology we formally derive yields a principled graph reduction technique and provides a finer characterization of the triangular geometry of all edges -- those that contribute to shortest paths and those that do not but are involved in other network phenomena. We demonstrate that the distance backbone is very small in large networks across domains ranging from air traffic to the human brain connectome, revealing that network robustness to attacks and failures seems to stem from surprisingly vast amounts of redundancy.
△ Less
Submitted 11 May, 2021; v1 submitted 8 March, 2021;
originally announced March 2021.
-
A survey on test practitioners' awareness of test smells
Authors:
Nildo Silva Junior,
Larissa Rocha,
Luana Almeida Martins,
Ivan Machado
Abstract:
Develo** test code may be a time-consuming task that usually requires much effort and cost, especially when it is done manually. Besides, during this process, developers and testers are likely to adopt bad design choices, which may lead to the introduction of the so-called test smells in test code. Test smells are bad solutions to either implement or design test code. As the test code with test…
▽ More
Develo** test code may be a time-consuming task that usually requires much effort and cost, especially when it is done manually. Besides, during this process, developers and testers are likely to adopt bad design choices, which may lead to the introduction of the so-called test smells in test code. Test smells are bad solutions to either implement or design test code. As the test code with test smells increases in size, these tests might become more complex, and as a consequence, much harder to understand and evolve correctly. Therefore, test smells may have a negative impact on the quality and maintenance of test code and may also harm the whole software testing activities. In this context, this study aims to understand whether test professionals non-intentionally insert test smells. We carried out an expert survey to analyze the usage frequency of a set of test smells. Sixty professionals from different companies participated in the survey. We selected 14 widely studied smells from the literature, which are also implemented in existing test smell detection tools. The yielded results indicate that experienced professionals introduce test smells during their daily programming tasks, even when they are using standardized practices from their companies, and not only for their personal assumptions. Another relevant evidence was that developers' professional experience can not be considered as a root-cause for the insertion of test smells in test code.
△ Less
Submitted 12 March, 2020;
originally announced March 2020.
-
Mining social media data for biomedical signals and health-related behavior
Authors:
Rion Brattig Correia,
Ian B. Wood,
Johan Bollen,
Luis M. Rocha
Abstract:
Social media data has been increasingly used to study biomedical and health-related phenomena. From cohort level discussions of a condition to planetary level analyses of sentiment, social media has provided scientists with unprecedented amounts of data to study human behavior and response associated with a variety of health conditions and medical treatments. Here we review recent work in mining s…
▽ More
Social media data has been increasingly used to study biomedical and health-related phenomena. From cohort level discussions of a condition to planetary level analyses of sentiment, social media has provided scientists with unprecedented amounts of data to study human behavior and response associated with a variety of health conditions and medical treatments. Here we review recent work in mining social media for biomedical, epidemiological, and social phenomena information relevant to the multilevel complexity of human health. We pay particular attention to topics where social media data analysis has shown the most progress, including pharmacovigilance, sentiment analysis especially for mental health, and other areas. We also discuss a variety of innovative uses of social media data for health-related applications and important limitations in social media data access and use.
△ Less
Submitted 28 January, 2020;
originally announced January 2020.
-
Modelling Opinion Dynamics in the Age of Algorithmic Personalisation
Authors:
Nicola Perra,
Luis E C Rocha
Abstract:
Modern technology has drastically changed the way we interact and consume information. For example, online social platforms allow for seamless communication exchanges at an unprecedented scale. However, we are still bounded by cognitive and temporal constraints. Our attention is limited and extremely valuable. Algorithmic personalisation has become a standard approach to tackle the information ove…
▽ More
Modern technology has drastically changed the way we interact and consume information. For example, online social platforms allow for seamless communication exchanges at an unprecedented scale. However, we are still bounded by cognitive and temporal constraints. Our attention is limited and extremely valuable. Algorithmic personalisation has become a standard approach to tackle the information overload problem. As result, the exposure to our friends' opinions and our perception about important issues might be distorted. However, the effects of algorithmic gatekee** on our hyper-connected society are poorly understood. Here, we devise an opinion dynamics model where individuals are connected through a social network and adopt opinions as function of the view points they are exposed to. We apply various filtering algorithms that select the opinions shown to users i) at random ii) considering time ordering or iii) their current beliefs. Furthermore, we investigate the interplay between such mechanisms and crucial features of real networks. We found that algorithmic filtering might influence opinions' share and distributions, especially in case information is biased towards the current opinion of each user. These effects are reinforced in networks featuring topological and spatial correlations where echo chambers and polarisation emerge. Conversely, heterogeneity in connectivity patterns reduces such tendency. We consider also a scenario where one opinion, through nudging, is centrally pushed to all users. Interestingly, even minimal nudging is able to change the status quo moving it towards the desired view point. Our findings suggest that simple filtering algorithms might be powerful tools to regulate opinion dynamics taking place on social networks
△ Less
Submitted 8 November, 2018;
originally announced November 2018.
-
Visual-Quality-Driven Learning for Underwater Vision Enhancement
Authors:
Walysson Vital Barbosa,
Henrique Grandinetti Barbosa Amaral,
Thiago Lages Rocha,
Erickson Rangel Nascimento
Abstract:
The image processing community has witnessed remarkable advances in enhancing and restoring images. Nevertheless, restoring the visual quality of underwater images remains a great challenge. End-to-end frameworks might fail to enhance the visual quality of underwater images since in several scenarios it is not feasible to provide the ground truth of the scene radiance. In this work, we propose a C…
▽ More
The image processing community has witnessed remarkable advances in enhancing and restoring images. Nevertheless, restoring the visual quality of underwater images remains a great challenge. End-to-end frameworks might fail to enhance the visual quality of underwater images since in several scenarios it is not feasible to provide the ground truth of the scene radiance. In this work, we propose a CNN-based approach that does not require ground truth data since it uses a set of image quality metrics to guide the restoration learning process. The experiments showed that our method improved the visual quality of underwater images preserving their edges and also performed well considering the UCIQE metric.
△ Less
Submitted 12 September, 2018;
originally announced September 2018.
-
CANA: A python package for quantifying control and canalization in Boolean Networks
Authors:
Rion Brattig Correia,
Alexander J. Gates,
Xuan Wang,
Luis M. Rocha
Abstract:
Logical models offer a simple but powerful means to understand the complex dynamics of biochemical regulation, without the need to estimate kinetic parameters. However, even simple automata components can lead to collective dynamics that are computationally intractable when aggregated into networks. In previous work we demonstrated that automata network models of biochemical regulation are highly…
▽ More
Logical models offer a simple but powerful means to understand the complex dynamics of biochemical regulation, without the need to estimate kinetic parameters. However, even simple automata components can lead to collective dynamics that are computationally intractable when aggregated into networks. In previous work we demonstrated that automata network models of biochemical regulation are highly canalizing, whereby many variable states and their grou**s are redundant (Marques-Pita and Rocha, 2013). The precise charting and measurement of such canalization simplifies these models, making even very large networks amenable to analysis. Moreover, canalization plays an important role in the control, robustness, modularity and criticality of Boolean network dynamics, especially those used to model biochemical regulation (Gates and Rocha, 2016; Gates et al., 2016; Manicka, 2017). Here we describe a new publicly-available Python package that provides the necessary tools to extract, measure, and visualize canalizing redundancy present in Boolean network models. It extracts the pathways most effective in controlling dynamics in these models, including their effective graph and dynamics canalizing map, as well as other tools to uncover minimum sets of control variables.
△ Less
Submitted 9 May, 2018; v1 submitted 9 March, 2018;
originally announced March 2018.
-
City-wide Analysis of Electronic Health Records Reveals Gender and Age Biases in the Administration of Known Drug-Drug Interactions
Authors:
Rion Brattig Correia,
Luciana P. de Araújo,
Mauro M. Mattos,
Luis M. Rocha
Abstract:
The occurrence of drug-drug-interactions (DDI) from multiple drug dispensations is a serious problem, both for individuals and health-care systems, since patients with complications due to DDI are likely to reenter the system at a costlier level. We present a large-scale longitudinal study (18 months) of the DDI phenomenon at the primary- and secondary-care level using electronic health records (E…
▽ More
The occurrence of drug-drug-interactions (DDI) from multiple drug dispensations is a serious problem, both for individuals and health-care systems, since patients with complications due to DDI are likely to reenter the system at a costlier level. We present a large-scale longitudinal study (18 months) of the DDI phenomenon at the primary- and secondary-care level using electronic health records (EHR) from the city of Blumenau in Southern Brazil (pop. $\approx 340,000$). We found that 181 distinct drug pairs known to interact were dispensed concomitantly to 12\% of the patients in the city's public health-care system. Further, 4\% of the patients were dispensed drug pairs that are likely to result in major adverse drug reactions (ADR)---with costs estimated to be much larger than previously reported in smaller studies. The large-scale analysis reveals that women have a 60\% increased risk of DDI as compared to men; the increase becomes 90\% when considering only DDI known to lead to major ADR. Furthermore, DDI risk increases substantially with age; patients aged 70-79 years have a 34\% risk of DDI when they are dispensed two or more drugs concomitantly. Interestingly, a statistical null model demonstrates that age- and female-specific risks from increased polypharmacy fail by far to explain the observed DDI risks in those populations, suggesting unknown social or biological causes. We also provide a network visualization of drugs and demographic factors that characterize the DDI phenomenon and demonstrate that accurate DDI prediction can be included in healthcare and public-health management, to reduce DDI-related ADR and costs.
△ Less
Submitted 2 January, 2020; v1 submitted 9 March, 2018;
originally announced March 2018.
-
The Reachability of Computer Programs
Authors:
Reginaldo I. Silva Filho,
Ricardo L. Azevedo da Rocha,
Camila Leite Silva,
Ricardo H. Gracini Guiraldelli
Abstract:
Would it be possible to explain the emergence of new computational ideas using the computation itself? Would it be feasible to describe the discovery process of new algorithmic solutions using only mathematics? This study is the first effort to analyze the nature of such inquiry from the viewpoint of effort to find a new algorithmic solution to a given problem. We define program reachability as a…
▽ More
Would it be possible to explain the emergence of new computational ideas using the computation itself? Would it be feasible to describe the discovery process of new algorithmic solutions using only mathematics? This study is the first effort to analyze the nature of such inquiry from the viewpoint of effort to find a new algorithmic solution to a given problem. We define program reachability as a probability function whose argument is a form of the energetic cost (algorithmic entropy) of the problem.
△ Less
Submitted 22 August, 2017;
originally announced August 2017.
-
Human Sexual Cycles are Driven by Culture and Match Collective Moods
Authors:
Ian B. Wood,
Pedro Leal Varela,
Johan Bollen,
Luis M. Rocha,
Joana Gonçalves-Sá
Abstract:
It is a long-standing question whether human sexual and reproductive cycles are affected predominantly by biology or culture. The literature is mixed with respect to whether biological or cultural factors best explain the reproduction cycle phenomenon, with biological explanations dominating the argument. The biological hypothesis proposes that human reproductive cycles are an adaptation to the se…
▽ More
It is a long-standing question whether human sexual and reproductive cycles are affected predominantly by biology or culture. The literature is mixed with respect to whether biological or cultural factors best explain the reproduction cycle phenomenon, with biological explanations dominating the argument. The biological hypothesis proposes that human reproductive cycles are an adaptation to the seasonal cycles caused by hemisphere positioning, while the cultural hypothesis proposes that conception dates vary mostly due to cultural factors, such as vacation schedule or religious holidays. However, for many countries, common records used to investigate these hypotheses are incomplete or unavailable, biasing existing analysis towards primarily Christian countries in the Northern Hemisphere. Here we show that interest in sex peaks sharply online during major cultural and religious celebrations, regardless of hemisphere location. This online interest, when shifted by nine months, corresponds to documented human birth cycles, even after adjusting for numerous factors such as language, season, and amount of free time due to holidays. We further show that mood, measured independently on Twitter, contains distinct collective emotions associated with those cultural celebrations, and these collective moods correlate with sex search volume outside of these holidays as well. Our results provide converging evidence that the cyclic sexual and reproductive behavior of human populations is mostly driven by culture and that this interest in sex is associated with specific emotions, characteristic of, but not limited to, major cultural and religious celebrations.
△ Less
Submitted 27 October, 2017; v1 submitted 12 July, 2017;
originally announced July 2017.
-
Sampling of Temporal Networks: Methods and Biases
Authors:
Luis E C Rocha,
Naoki Masuda,
Petter Holme
Abstract:
Temporal networks have been increasingly used to model a diversity of systems that evolve in time; for example human contact structures over which dynamic processes such as epidemics take place. A fundamental aspect of real-life networks is that they are sampled within temporal and spatial frames. Furthermore, one might wish to subsample networks to reduce their size for better visualization or to…
▽ More
Temporal networks have been increasingly used to model a diversity of systems that evolve in time; for example human contact structures over which dynamic processes such as epidemics take place. A fundamental aspect of real-life networks is that they are sampled within temporal and spatial frames. Furthermore, one might wish to subsample networks to reduce their size for better visualization or to perform computationally intensive simulations. The sampling method may affect the network structure and thus caution is necessary to generalize results based on samples. In this paper, we study four sampling strategies applied to a variety of real-life temporal networks. We quantify the biases generated by each sampling strategy on a number of relevant statistics such as link activity, temporal paths and epidemic spread. We find that some biases are common in a variety of networks and statistics, but one strategy, uniform sampling of nodes, shows improved performance in most scenarios. Our results help researchers to better design network data collection protocols and to understand the limitations of sampled temporal network data.
△ Less
Submitted 7 July, 2017;
originally announced July 2017.
-
Multiple seed structure and disconnected networks in respondent-driven sampling
Authors:
Jens Malmros,
Luis E. C. Rocha
Abstract:
Respondent-driven sampling (RDS) is a link-tracing sampling method that is especially suitable for sampling hidden populations. RDS combines an efficient snowball-type sampling scheme with inferential procedures that yield unbiased population estimates under some assumptions about the sampling procedure and population structure. Several seed individuals are typically used to initiate RDS recruitme…
▽ More
Respondent-driven sampling (RDS) is a link-tracing sampling method that is especially suitable for sampling hidden populations. RDS combines an efficient snowball-type sampling scheme with inferential procedures that yield unbiased population estimates under some assumptions about the sampling procedure and population structure. Several seed individuals are typically used to initiate RDS recruitment. However, standard RDS estimation theory assume that all sampled individuals originate from only one seed. We present an estimator, based on a random walk with teleportation, which accounts for the multiple seed structure of RDS. The new estimator can also be used on populations with disconnected social networks. We numerically evaluate our estimator by simulations on artificial and real networks. Our estimator outperforms previous estimators, especially when the proportion of seeds in the sample is large. We recommend our new estimator to be used in RDS studies, in particular when the number of seeds is large or the social network of the population is disconnected.
△ Less
Submitted 14 March, 2016;
originally announced March 2016.
-
Monitoring Potential Drug Interactions and Reactions via Network Analysis of Instagram User Timelines
Authors:
Rion Brattig Correia,
Lang Li,
Luis M. Rocha
Abstract:
Much recent research aims to identify evidence for Drug-Drug Interactions (DDI) and Adverse Drug reactions (ADR) from the biomedical scientific literature. In addition to this "Bibliome", the universe of social media provides a very promising source of large-scale data that can help identify DDI and ADR in ways that have not been hitherto possible. Given the large number of users, analysis of soci…
▽ More
Much recent research aims to identify evidence for Drug-Drug Interactions (DDI) and Adverse Drug reactions (ADR) from the biomedical scientific literature. In addition to this "Bibliome", the universe of social media provides a very promising source of large-scale data that can help identify DDI and ADR in ways that have not been hitherto possible. Given the large number of users, analysis of social media data may be useful to identify under-reported, population-level pathology associated with DDI, thus further contributing to improvements in population health. Moreover, tap** into this data allows us to infer drug interactions with natural products--including cannabis--which constitute an array of DDI very poorly explored by biomedical research thus far. Our goal is to determine the potential of Instagram for public health monitoring and surveillance for DDI, ADR, and behavioral pathology at large. Using drug, symptom, and natural product dictionaries for identification of the various types of DDI and ADR evidence, we have collected ~7000 timelines. We report on 1) the development of a monitoring tool to easily observe user-level timelines associated with drug and symptom terms of interest, and 2) population-level behavior via the analysis of co-occurrence networks computed from user timelines at three different scales: monthly, weekly, and daily occurrences. Analysis of these networks further reveals 3) drug and symptom direct and indirect associations with greater support in user timelines, as well as 4) clusters of symptoms and drugs revealed by the collective behavior of the observed population. This demonstrates that Instagram contains much drug- and pathology specific data for public health monitoring of DDI and ADR, and that complex network analysis provides an important toolbox to extract health-related associations and their support from large-scale social media data.
△ Less
Submitted 14 January, 2016; v1 submitted 4 October, 2015;
originally announced October 2015.
-
Temporal and structural heterogeneities emerging in adaptive temporal networks
Authors:
Takaaki Aoki,
Luis E. C. Rocha,
Thilo Gross
Abstract:
We introduce a model of adaptive temporal networks whose evolution is regulated by an interplay between node activity and dynamic exchange of information through links. We study the model by using a master equation approach. Starting from a homogeneous initial configuration, we show that temporal and structural heterogeneities, characteristic of real-world networks, spontaneously emerge. This theo…
▽ More
We introduce a model of adaptive temporal networks whose evolution is regulated by an interplay between node activity and dynamic exchange of information through links. We study the model by using a master equation approach. Starting from a homogeneous initial configuration, we show that temporal and structural heterogeneities, characteristic of real-world networks, spontaneously emerge. This theoretically tractable model thus contributes to the understanding of the dynamics of human activity and interaction networks.
△ Less
Submitted 4 April, 2016; v1 submitted 1 October, 2015;
originally announced October 2015.
-
Modularity and the spread of perturbations in complex dynamical systems
Authors:
Artemy Kolchinsky,
Alexander J. Gates,
Luis M. Rocha
Abstract:
We propose a method to decompose dynamical systems based on the idea that modules constrain the spread of perturbations. We find partitions of system variables that maximize 'perturbation modularity', defined as the autocovariance of coarse-grained perturbed trajectories. The measure effectively separates the fast intramodular from the slow intermodular dynamics of perturbation spreading (in this…
▽ More
We propose a method to decompose dynamical systems based on the idea that modules constrain the spread of perturbations. We find partitions of system variables that maximize 'perturbation modularity', defined as the autocovariance of coarse-grained perturbed trajectories. The measure effectively separates the fast intramodular from the slow intermodular dynamics of perturbation spreading (in this respect, it is a generalization of the 'Markov stability' method of network community detection). Our approach captures variation of modular organization across different system states, time scales, and in response to different kinds of perturbations: aspects of modularity which are all relevant to real-world dynamical systems. It offers a principled alternative to detecting communities in networks of statistical dependencies between system variables (e.g., 'relevance networks' or 'functional networks'). Using coupled logistic maps, we demonstrate that the method uncovers hierarchical modular organization planted in a system's coupling matrix. Additionally, in homogeneously-coupled map lattices, it identifies the presence of self-organized modularity that depends on the initial state, dynamical parameters, and type of perturbations. Our approach offers a powerful tool for exploring the modular organization of complex dynamical systems.
△ Less
Submitted 23 December, 2015; v1 submitted 14 September, 2015;
originally announced September 2015.
-
Respondent-driven sampling bias induced by clustering and community structure in social networks
Authors:
Luis Enrique Correa Rocha,
Anna Ekeus Thorson,
Renaud Lambiotte,
Fredrik Liljeros
Abstract:
Sampling hidden populations is particularly challenging using standard sampling methods mainly because of the lack of a sampling frame. Respondent-driven sampling (RDS) is an alternative methodology that exploits the social contacts between peers to reach and weight individuals in these hard-to-reach populations. It is a snowball sampling procedure where the weight of the respondents is adjusted f…
▽ More
Sampling hidden populations is particularly challenging using standard sampling methods mainly because of the lack of a sampling frame. Respondent-driven sampling (RDS) is an alternative methodology that exploits the social contacts between peers to reach and weight individuals in these hard-to-reach populations. It is a snowball sampling procedure where the weight of the respondents is adjusted for the likelihood of being sampled due to differences in the number of contacts. In RDS, the structure of the social contacts thus defines the sampling process and affects its coverage, for instance by constraining the sampling within a sub-region of the network. In this paper we study the bias induced by network structures such as social triangles, community structure, and heterogeneities in the number of contacts, in the recruitment trees and in the RDS estimator. We simulate different scenarios of network structures and response-rates to study the potential biases one may expect in real settings. We find that the prevalence of the estimated variable is associated with the size of the network community to which the individual belongs. Furthermore, we observe that low-degree nodes may be under-sampled in certain situations if the sample and the network are of similar size. Finally, we also show that low response-rates lead to reasonably accurate average estimates of the prevalence but generate relatively large biases.
△ Less
Submitted 19 March, 2015;
originally announced March 2015.
-
Computational fact checking from knowledge networks
Authors:
Giovanni Luca Ciampaglia,
Prashant Shiralkar,
Luis M. Rocha,
Johan Bollen,
Filippo Menczer,
Alessandro Flammini
Abstract:
Traditional fact checking by expert journalists cannot keep up with the enormous volume of information that is now generated online. Computational fact checking may significantly enhance our ability to evaluate the veracity of dubious information. Here we show that the complexities of human fact checking can be approximated quite well by finding the shortest path between concept nodes under proper…
▽ More
Traditional fact checking by expert journalists cannot keep up with the enormous volume of information that is now generated online. Computational fact checking may significantly enhance our ability to evaluate the veracity of dubious information. Here we show that the complexities of human fact checking can be approximated quite well by finding the shortest path between concept nodes under properly defined semantic proximity metrics on knowledge graphs. Framed as a network problem this approach is feasible with efficient computational techniques. We evaluate this approach by examining tens of thousands of claims related to history, entertainment, geography, and biographical information using a public knowledge graph extracted from Wikipedia. Statements independently known to be true consistently receive higher support via our method than do false ones. These findings represent a significant step toward scalable computational fact-checking methods that may one day mitigate the spread of harmful misinformation.
△ Less
Submitted 14 January, 2015;
originally announced January 2015.
-
Extraction of Pharmacokinetic Evidence of Drug-drug Interactions from the Literature
Authors:
Artemy Kolchinsky,
Anália Lourenço,
Heng-Yi Wu,
Lang Li,
Luis M. Rocha
Abstract:
Drug-drug interaction (DDI) is a major cause of morbidity and mortality and a subject of intense scientific interest. Biomedical literature mining can aid DDI research by extracting evidence for large numbers of potential interactions from published literature and clinical databases. Though DDI is investigated in domains ranging in scale from intracellular biochemistry to human populations, litera…
▽ More
Drug-drug interaction (DDI) is a major cause of morbidity and mortality and a subject of intense scientific interest. Biomedical literature mining can aid DDI research by extracting evidence for large numbers of potential interactions from published literature and clinical databases. Though DDI is investigated in domains ranging in scale from intracellular biochemistry to human populations, literature mining has not been used to extract specific types of experimental evidence, which are reported differently for distinct experimental goals. We focus on pharmacokinetic evidence for DDI, essential for identifying causal mechanisms of putative interactions and as input for further pharmacological and pharmaco-epidemiology investigations. We used manually curated corpora of PubMed abstracts and annotated sentences to evaluate the efficacy of literature mining on two tasks: first, identifying PubMed abstracts containing pharmacokinetic evidence of DDIs; second, extracting sentences containing such evidence from abstracts. We implemented a text mining pipeline and evaluated it using several linear classifiers and a variety of feature transforms. The most important textual features in the abstract and sentence classification tasks were analyzed. We also investigated the performance benefits of using features derived from PubMed metadata fields, various publicly available named entity recognizers, and pharmacokinetic dictionaries. Several classifiers performed very well in distinguishing relevant and irrelevant abstracts (reaching F1~=0.93, MCC~=0.74, iAUC~=0.99) and sentences (F1~=0.76, MCC~=0.65, iAUC~=0.83). We found that word bigram features were important for achieving optimal classifier performance and that features derived from Medical Subject Headings (MeSH) terms significantly improved abstract classification. ...
△ Less
Submitted 18 May, 2015; v1 submitted 1 December, 2014;
originally announced December 2014.
-
Designing a minimalist socially aware robotic agent for the home
Authors:
Matthew R. Francisco,
Ian Wood,
Selma Šabanović,
Luis M. Rocha
Abstract:
We present a minimalist social robot that relies on long timeseries of low resolution data such as mechanical vibration, temperature, lighting, sounds and collisions. Our goal is to develop an experimental system for growing socially situated robotic agents whose behavioral repertoire is subsumed by the social order of the space. To get there we are designing robots that use their simple sensors a…
▽ More
We present a minimalist social robot that relies on long timeseries of low resolution data such as mechanical vibration, temperature, lighting, sounds and collisions. Our goal is to develop an experimental system for growing socially situated robotic agents whose behavioral repertoire is subsumed by the social order of the space. To get there we are designing robots that use their simple sensors and motion feedback routines to recognize different classes of human activity and then associate to each class a range of appropriate behaviors. We use the Katie Family of robots, built on the iRobot Create platform, an Arduino Uno, and a Raspberry Pi. We describe its sensor abilities and exploratory tests that allow us to develop hypotheses about what objects (sensor data) correspond to something known and observable by a human subject. We use machine learning methods to classify three social scenarios from over a hundred experiments, demonstrating that it is possible to detect social situations with high accuracy, using the low-resolution sensors from our minimalist robot.
△ Less
Submitted 26 June, 2014;
originally announced June 2014.
-
Random walk centrality for temporal networks
Authors:
Luis Enrique Correa Rocha,
Naoki Masuda
Abstract:
Nodes can be ranked according to their relative importance within the network. Ranking algorithms based on random walks are particularly useful because they connect topological and diffusive properties of the network. Previous methods based on random walks, as for example the PageRank, have focused on static structures. However, several realistic networks are indeed dynamic, meaning that their str…
▽ More
Nodes can be ranked according to their relative importance within the network. Ranking algorithms based on random walks are particularly useful because they connect topological and diffusive properties of the network. Previous methods based on random walks, as for example the PageRank, have focused on static structures. However, several realistic networks are indeed dynamic, meaning that their structure changes in time. In this paper, we propose a centrality measure for temporal networks based on random walks which we call TempoRank. While in a static network, the stationary density of the random walk is proportional to the degree or the strength of a node, we find that in temporal networks, the stationary density is proportional to the in-strength of the so-called effective network. The stationary density also depends on the sojourn probability q which regulates the tendency of the walker to stay in the node. We apply our method to human interaction networks and show that although it is important for a node to be connected to another node with many random walkers at the right moment (one of the principles of the PageRank), this effect is negligible in practice when the time order of link activation is included.
△ Less
Submitted 22 January, 2014;
originally announced January 2014.
-
Distance Closures on Complex Networks
Authors:
Tiago Simas,
Luis M Rocha
Abstract:
To expand the toolbox available to network science, we study the isomorphism between distance and Fuzzy (proximity or strength) graphs. Distinct transitive closures in Fuzzy graphs lead to closures of their isomorphic distance graphs with widely different structural properties. For instance, the All Pairs Shortest Paths (APSP) problem, based on the Dijkstra algorithm, is equivalent to a metric clo…
▽ More
To expand the toolbox available to network science, we study the isomorphism between distance and Fuzzy (proximity or strength) graphs. Distinct transitive closures in Fuzzy graphs lead to closures of their isomorphic distance graphs with widely different structural properties. For instance, the All Pairs Shortest Paths (APSP) problem, based on the Dijkstra algorithm, is equivalent to a metric closure, which is only one of the possible ways to calculate shortest paths. Understanding and map** this isomorphism is necessary to analyse models of complex networks based on weighted graphs. Any conclusions derived from such models should take into account the distortions imposed on graph topology when converting proximity/strength into distance graphs, to subsequently compute path length and shortest path measures. We characterise the isomorphism using the max-min and Dombi disjunction/conjunction pairs. This allows us to: (1) study alternative distance closures, such as those based on diffusion, metric, and ultra-metric distances; (2) identify the operators closest to the metric closure of distance graphs (the APSP), but which are logically consistent; and (3) propose a simple method to compute alternative distance closures using existing algorithms for the APSP. In particular, we show that a specific diffusion distance is promising for community detection in complex networks, and is based on desirable axioms for logical inference or approximate reasoning on networks; it also provides a simple algebraic means to compute diffusion processes on networks. Based on these results, we argue that choosing different distance closures can lead to different conclusions about indirect associations on network data, as well as the structure of complex networks, and are thus important to consider.
△ Less
Submitted 16 October, 2014; v1 submitted 9 December, 2013;
originally announced December 2013.
-
Flow Motifs Reveal Limitations of the Static Framework to Represent Human interactions
Authors:
Luis Enrique Correa Rocha,
Vincent D Blondel
Abstract:
Networks are commonly used to define underlying interaction structures where infections, information, or other quantities may spread. Although the standard approach has been to aggregate all links into a static structure, some studies suggest that the time order in which the links are established may alter the dynamics of spreading. In this paper, we study the impact of the time ordering in the li…
▽ More
Networks are commonly used to define underlying interaction structures where infections, information, or other quantities may spread. Although the standard approach has been to aggregate all links into a static structure, some studies suggest that the time order in which the links are established may alter the dynamics of spreading. In this paper, we study the impact of the time ordering in the limits of flow on various empirical temporal networks. By using a random walk dynamics, we estimate the flow on links and convert the original undirected network (temporal and static) into a directed flow network. We then introduce the concept of flow motifs and quantify the divergence in the representativity of motifs when using the temporal and static frameworks. We find that the regularity of contacts and persistence of vertices (common in email communication and face-to-face interactions) result on little differences in the limits of flow for both frameworks. On the other hand, in the case of communication within a dating site (and of a sexual network), the flow between vertices changes significantly in the temporal framework such that the static approximation poorly represents the structure of contacts. We have also observed that cliques with 3 and 4 vertices con- taining only low-flow links are more represented than the same cliques with all high-flow links. The representativity of these low-flow cliques is higher in the temporal framework. Our results suggest that the flow between vertices connected in cliques depend on the topological context in which they are placed and in the time sequence in which the links are established. The structure of the clique alone does not completely characterize the potential of flow between the vertices.
△ Less
Submitted 13 March, 2013;
originally announced March 2013.
-
Canalization and control in automata networks: body segmentation in Drosophila melanogaster
Authors:
Manuel Marques-Pita,
Luis M. Rocha
Abstract:
We present schema redescription as a methodology to characterize canalization in automata networks used to model biochemical regulation and signalling. In our formulation, canalization becomes synonymous with redundancy present in the logic of automata. This results in straightforward measures to quantify canalization in an automaton (micro-level), which is in turn integrated into a highly scalabl…
▽ More
We present schema redescription as a methodology to characterize canalization in automata networks used to model biochemical regulation and signalling. In our formulation, canalization becomes synonymous with redundancy present in the logic of automata. This results in straightforward measures to quantify canalization in an automaton (micro-level), which is in turn integrated into a highly scalable framework to characterize the collective dynamics of large-scale automata networks (macro-level). This way, our approach provides a method to link micro- to macro-level dynamics -- a crux of complexity. Several new results ensue from this methodology: uncovering of dynamical modularity (modules in the dynamics rather than in the structure of networks), identification of minimal conditions and critical nodes to control the convergence to attractors, simulation of dynamical behaviour from incomplete information about initial conditions, and measures of macro-level canalization and robustness to perturbations. We exemplify our methodology with a well-known model of the intra- and inter cellular genetic regulation of body segmentation in Drosophila melanogaster. We use this model to show that our analysis does not contradict any previous findings. But we also obtain new knowledge about its behaviour: a better understanding of the size of its wild-type attractor basin (larger than previously thought), the identification of novel minimal conditions and critical nodes that control wild-type behaviour, and the resilience of these to stochastic interventions. Our methodology is applicable to any complex network that can be modelled using automata, but we focus on biochemical regulation and signalling, towards a better understanding of the (decentralized) control that orchestrates cellular activity -- with the ultimate goal of explaining how do cells and tissues 'compute'.
△ Less
Submitted 25 January, 2013; v1 submitted 24 January, 2013;
originally announced January 2013.
-
Evaluation of linear classifiers on articles containing pharmacokinetic evidence of drug-drug interactions
Authors:
Artemy Kolchinsky,
Anália Lourenço,
Lang Li,
Luis M. Rocha
Abstract:
Background. Drug-drug interaction (DDI) is a major cause of morbidity and mortality. [...] Biomedical literature mining can aid DDI research by extracting relevant DDI signals from either the published literature or large clinical databases. However, though drug interaction is an ideal area for translational research, the inclusion of literature mining methodologies in DDI workflows is still very…
▽ More
Background. Drug-drug interaction (DDI) is a major cause of morbidity and mortality. [...] Biomedical literature mining can aid DDI research by extracting relevant DDI signals from either the published literature or large clinical databases. However, though drug interaction is an ideal area for translational research, the inclusion of literature mining methodologies in DDI workflows is still very preliminary. One area that can benefit from literature mining is the automatic identification of a large number of potential DDIs, whose pharmacological mechanisms and clinical significance can then be studied via in vitro pharmacology and in populo pharmaco-epidemiology. Experiments. We implemented a set of classifiers for identifying published articles relevant to experimental pharmacokinetic DDI evidence. These documents are important for identifying causal mechanisms behind putative drug-drug interactions, an important step in the extraction of large numbers of potential DDIs. We evaluate performance of several linear classifiers on PubMed abstracts, under different feature transformation and dimensionality reduction methods. In addition, we investigate the performance benefits of including various publicly-available named entity recognition features, as well as a set of internally-developed pharmacokinetic dictionaries. Results. We found that several classifiers performed well in distinguishing relevant and irrelevant abstracts. We found that the combination of unigram and bigram textual features gave better performance than unigram features alone, and also that normalization transforms that adjusted for feature frequency and document length improved classification. For some classifiers, such as linear discriminant analysis (LDA), proper dimensionality reduction had a large impact on performance. Finally, the inclusion of NER features and dictionaries was found not to help classification.
△ Less
Submitted 2 October, 2012;
originally announced October 2012.
-
Semi-metric networks for recommender systems
Authors:
Tiago Simas,
Luis M. Rocha
Abstract:
Weighted graphs obtained from co-occurrence in user-item relations lead to non-metric topologies. We use this semi-metric behavior to issue recommendations, and discuss its relationship to transitive closure on fuzzy graphs. Finally, we test the performance of this method against other item- and user-based recommender systems on the Movielens benchmark. We show that including highly semi-metric ed…
▽ More
Weighted graphs obtained from co-occurrence in user-item relations lead to non-metric topologies. We use this semi-metric behavior to issue recommendations, and discuss its relationship to transitive closure on fuzzy graphs. Finally, we test the performance of this method against other item- and user-based recommender systems on the Movielens benchmark. We show that including highly semi-metric edges in our recommendation algorithms leads to better recommendations.
△ Less
Submitted 8 September, 2012;
originally announced September 2012.
-
Temporal Heterogeneities Increase the Prevalence of Epidemics on Evolving Networks
Authors:
Luis Enrique Correa Rocha,
Vincent D. Blondel
Abstract:
Empirical studies suggest that contact patterns follow heterogeneous inter-event times, meaning that intervals of high activity are followed by periods of inactivity. Combined with birth and death of individuals, these temporal constraints affect the spread of infections in a non-trivial way and are dependent on the particular contact dynamics. We propose a stochastic model to generate temporal ne…
▽ More
Empirical studies suggest that contact patterns follow heterogeneous inter-event times, meaning that intervals of high activity are followed by periods of inactivity. Combined with birth and death of individuals, these temporal constraints affect the spread of infections in a non-trivial way and are dependent on the particular contact dynamics. We propose a stochastic model to generate temporal networks where vertices make instantaneous contacts following heterogeneous inter-event times, and leave and enter the system at fixed rates. We study how these temporal properties affect the prevalence of an infection and estimate R0, the number of secondary infections, by modeling simulated infections (SIR, SI and SIS) co-evolving with the network structure. We find that heterogeneous contact patterns cause earlier and larger epidemics on the SIR model in comparison to homogeneous scenarios. In case of SI and SIS, the epidemics is faster in the early stages (up to 90% of prevalence) followed by a slowdown in the asymptotic limit in case of heterogeneous patterns. In the presence of birth and death, heterogeneous patterns always cause higher prevalence in comparison to homogeneous scenarios with same average inter-event times. Our results suggest that R0 may be underestimated if temporal heterogeneities are not taken into account in the modeling of epidemics.
△ Less
Submitted 26 June, 2012;
originally announced June 2012.
-
Prediction and Modularity in Dynamical Systems
Authors:
Artemy Kolchinsky,
Luis M. Rocha
Abstract:
Identifying and understanding modular organizations is centrally important in the study of complex systems. Several approaches to this problem have been advanced, many framed in information-theoretic terms. Our treatment starts from the complementary point of view of statistical modeling and prediction of dynamical systems. It is known that for finite amounts of training data, simpler models can h…
▽ More
Identifying and understanding modular organizations is centrally important in the study of complex systems. Several approaches to this problem have been advanced, many framed in information-theoretic terms. Our treatment starts from the complementary point of view of statistical modeling and prediction of dynamical systems. It is known that for finite amounts of training data, simpler models can have greater predictive power than more complex ones. We use the trade-off between model simplicity and predictive accuracy to generate optimal multiscale decompositions of dynamical networks into weakly-coupled, simple modules. State-dependent and causal versions of our method are also proposed.
△ Less
Submitted 16 January, 2015; v1 submitted 19 June, 2011;
originally announced June 2011.
-
A Linear Classifier Based on Entity Recognition Tools and a Statistical Approach to Method Extraction in the Protein-Protein Interaction Literature
Authors:
Anália Lourenço,
Michael Conover,
Andrew Wong,
Azadeh Nematzadeh,
Fengxia Pan,
Hagit Shatkay,
Luis M. Rocha
Abstract:
We participated, in the Article Classification and the Interaction Method subtasks (ACT and IMT, respectively) of the Protein-Protein Interaction task of the BioCreative III Challenge. For the ACT, we pursued an extensive testing of available Named Entity Recognition and dictionary tools, and used the most promising ones to extend our Variable Trigonometric Threshold linear classifier. For the IMT…
▽ More
We participated, in the Article Classification and the Interaction Method subtasks (ACT and IMT, respectively) of the Protein-Protein Interaction task of the BioCreative III Challenge. For the ACT, we pursued an extensive testing of available Named Entity Recognition and dictionary tools, and used the most promising ones to extend our Variable Trigonometric Threshold linear classifier. For the IMT, we experimented with a primarily statistical approach, as opposed to employing a deeper natural language processing strategy. Finally, we also studied the benefits of integrating the method extraction approach that we have used for the IMT into the ACT pipeline. For the ACT, our linear article classifier leads to a ranking and classification performance significantly higher than all the reported submissions. For the IMT, our results are comparable to those of other systems, which took very different approaches. For the ACT, we show that the use of named entity recognition tools leads to a substantial improvement in the ranking and classification of articles relevant to protein-protein interaction. Thus, we show that our substantially expanded linear classifier is a very competitive classifier in this domain. Moreover, this classifier produces interpretable surfaces that can be understood as "rules" for human understanding of the classification. In terms of the IMT task, in contrast to other participants, our approach focused on identifying sentences that are likely to bear evidence for the application of a PPI detection method, rather than on classifying a document as relevant to a method. As BioCreative III did not perform an evaluation of the evidence provided by the system, we have conducted a separate assessment; the evaluators agree that our tool is indeed effective in detecting relevant evidence for PPI detection methods.
△ Less
Submitted 22 April, 2011; v1 submitted 21 March, 2011;
originally announced March 2011.
-
Schema Redescription in Cellular Automata: Revisiting Emergence in Complex Systems
Authors:
Manuel Marques-Pita,
Luis M. Rocha
Abstract:
We present a method to eliminate redundancy in the transition tables of Boolean automata: schema redescription with two symbols. One symbol is used to capture redundancy of individual input variables, and another to capture permutability in sets of input variables: fully characterizing the canalization present in Boolean functions. Two-symbol schemata explain aspects of the behaviour of automata n…
▽ More
We present a method to eliminate redundancy in the transition tables of Boolean automata: schema redescription with two symbols. One symbol is used to capture redundancy of individual input variables, and another to capture permutability in sets of input variables: fully characterizing the canalization present in Boolean functions. Two-symbol schemata explain aspects of the behaviour of automata networks that the characterization of their emergent patterns does not capture. We use our method to compare two well-known cellular automata for the density classification task: the human engineered CA GKL, and another obtained via genetic programming (GP). We show that despite having very different collective behaviour, these rules are very similar. Indeed, GKL is a special case of GP. Therefore, we demonstrate that it is more feasible to compare cellular automata via schema redescriptions of their rules, than by looking at their emergent behaviour, leading us to question the tendency in complexity research to pay much more attention to emergent patterns than to local interactions.
△ Less
Submitted 9 February, 2011; v1 submitted 8 February, 2011;
originally announced February 2011.
-
Collective Classification of Textual Documents by Guided Self-Organization in T-Cell Cross-Regulation Dynamics
Authors:
Alaa Abi-Haidar,
Luis M. Rocha
Abstract:
We present and study an agent-based model of T-Cell cross-regulation in the adaptive immune system, which we apply to binary classification. Our method expands an existing analytical model of T-cell cross-regulation (Carneiro et al. in Immunol Rev 216(1):48-68, 2007) that was used to study the self-organizing dynamics of a single population of T-Cells in interaction with an idealized antigen prese…
▽ More
We present and study an agent-based model of T-Cell cross-regulation in the adaptive immune system, which we apply to binary classification. Our method expands an existing analytical model of T-cell cross-regulation (Carneiro et al. in Immunol Rev 216(1):48-68, 2007) that was used to study the self-organizing dynamics of a single population of T-Cells in interaction with an idealized antigen presenting cell capable of presenting a single antigen. With agent-based modeling we are able to study the self-organizing dynamics of multiple populations of distinct T-cells which interact via antigen presenting cells that present hundreds of distinct antigens. Moreover, we show that such self-organizing dynamics can be guided to produce an effective binary classification of antigens, which is competitive with existing machine learning methods when applied to biomedical text classification. More specifically, here we test our model on a dataset of publicly available full-text biomedical articles provided by the BioCreative challenge (Krallinger in The biocreative ii. 5 challenge overview, p 19, 2009). We study the robustness of our model's parameter configurations, and show that it leads to encouraging results comparable to state-of-the-art classifiers. Our results help us understand both T-cell cross-regulation as a general principle of guided self-organization, as well as its applicability to document classification. Therefore, we show that our bio-inspired algorithm is a promising novel method for biomedical article classification and for binary document classification in general.
△ Less
Submitted 4 February, 2011;
originally announced February 2011.
-
The meta book and size-dependent properties of written language
Authors:
Sebastian Bernhardsson,
Luis Enrique Correa da Rocha,
Petter Minnhagen
Abstract:
Evidence is given for a systematic text-length dependence of the power-law index gamma of a single book. The estimated gamma values are consistent with a monotonic decrease from 2 to 1 with increasing length of a text. A direct connection to an extended Heap's law is explored. The infinite book limit is, as a consequence, proposed to be given by gamma = 1 instead of the value gamma=2 expected if…
▽ More
Evidence is given for a systematic text-length dependence of the power-law index gamma of a single book. The estimated gamma values are consistent with a monotonic decrease from 2 to 1 with increasing length of a text. A direct connection to an extended Heap's law is explored. The infinite book limit is, as a consequence, proposed to be given by gamma = 1 instead of the value gamma=2 expected if the Zipf's law was ubiquitously applicable. In addition we explore the idea that the systematic text-length dependence can be described by a meta book concept, which is an abstract representation reflecting the word-frequency structure of a text. According to this concept the word-frequency distribution of a text, with a certain length written by a single author, has the same characteristics as a text of the same length pulled out from an imaginary complete infinite corpus written by the same author.
△ Less
Submitted 24 September, 2009;
originally announced September 2009.