-
Comparing E-bike and Conventional Bicycle Use Patterns in a Public Bike Share System: A Case Study of Richmond, VA
Authors:
Yifan Yang,
Elliott Sloate,
Nashid Khadem,
Celeste Chavis,
Vanessa Frias Martinez
Abstract:
The results show that pedelecs are generally associated with longer trip distances, shorter trip times, higher speeds, and lower rates of uphill elevation change. The origin-destination analysis considering the business, mixed use, residential, and other uses shows extremely similar trends, with a large number of trips staying within either business or residential locations or mixed use. The roadw…
▽ More
The results show that pedelecs are generally associated with longer trip distances, shorter trip times, higher speeds, and lower rates of uphill elevation change. The origin-destination analysis considering the business, mixed use, residential, and other uses shows extremely similar trends, with a large number of trips staying within either business or residential locations or mixed use. The roadway use analysis shows that pedelecs are used farther outside of the city than bikes.
△ Less
Submitted 28 April, 2024;
originally announced April 2024.
-
Generalizing Roberts' characterization of unit interval graphs
Authors:
Virginia Ardévol Martínez,
Romeo Rizzi,
Abdallah Saffidine,
Florian Sikora,
Stéphane Vialette
Abstract:
For any natural number $d$, a graph $G$ is a (disjoint) $d$-interval graph if it is the intersection graph of (disjoint) $d$-intervals, the union of $d$ (disjoint) intervals on the real line. Two important subclasses of $d$-interval graphs are unit and balanced $d$-interval graphs (where every interval has unit length or all the intervals associated to a same vertex have the same length, respectiv…
▽ More
For any natural number $d$, a graph $G$ is a (disjoint) $d$-interval graph if it is the intersection graph of (disjoint) $d$-intervals, the union of $d$ (disjoint) intervals on the real line. Two important subclasses of $d$-interval graphs are unit and balanced $d$-interval graphs (where every interval has unit length or all the intervals associated to a same vertex have the same length, respectively). A celebrated result by Roberts gives a simple characterization of unit interval graphs being exactly claw-free interval graphs. Here, we study the generalization of this characterization for $d$-interval graphs. In particular, we prove that for any $d \geq 2$, if $G$ is a $K_{1,2d+1}$-free interval graph, then $G$ is a unit $d$-interval graph. However, somehow surprisingly, under the same assumptions, $G$ is not always a \emph{disjoint} unit $d$-interval graph. This implies that the class of disjoint unit $d$-interval graphs is strictly included in the class of unit $d$-interval graphs. Finally, we study the relationships between the classes obtained under disjoint and non-disjoint $d$-intervals in the balanced case and show that the classes of disjoint balanced 2-intervals and balanced 2-intervals coincide, but this is no longer true for $d>2$.
△ Less
Submitted 27 April, 2024;
originally announced April 2024.
-
Computational Complexity of Preferred Subset Repairs on Data-Graphs
Authors:
Nina Pardal,
Santiago Cifuentes,
Edwin Pin,
Maria Vanina Martinez,
Sergio Abriola
Abstract:
Preferences are a pivotal component in practical reasoning, especially in tasks that involve decision-making over different options or courses of action that could be pursued. In this work, we focus on repairing and querying inconsistent knowledge bases in the form of graph databases, which involves finding a way to solve conflicts in the knowledge base and considering answers that are entailed fr…
▽ More
Preferences are a pivotal component in practical reasoning, especially in tasks that involve decision-making over different options or courses of action that could be pursued. In this work, we focus on repairing and querying inconsistent knowledge bases in the form of graph databases, which involves finding a way to solve conflicts in the knowledge base and considering answers that are entailed from every possible repair, respectively. Without a priori domain knowledge, all possible repairs are equally preferred. Though that may be adequate for some settings, it seems reasonable to establish and exploit some form of preference order among the potential repairs. We study the problem of computing prioritized repairs over graph databases with data values, using a notion of consistency based on GXPath expressions as integrity constraints. We present several preference criteria based on the standard subset repair semantics, incorporating weights, multisets, and set-based priority levels. We show that it is possible to maintain the same computational complexity as in the case where no preference criterion is available for exploitation. Finally, we explore the complexity of consistent query answering in this setting and obtain tight lower and upper bounds for all the preference criteria introduced.
△ Less
Submitted 27 May, 2024; v1 submitted 14 February, 2024;
originally announced February 2024.
-
The Distributional Uncertainty of the SHAP score in Explainable Machine Learning
Authors:
Santiago Cifuentes,
Leopoldo Bertossi,
Nina Pardal,
Sergio Abriola,
Maria Vanina Martinez,
Miguel Romero
Abstract:
Attribution scores reflect how important the feature values in an input entity are for the output of a machine learning model. One of the most popular attribution scores is the SHAP score, which is an instantiation of the general Shapley value used in coalition game theory. The definition of this score relies on a probability distribution on the entity population. Since the exact distribution is g…
▽ More
Attribution scores reflect how important the feature values in an input entity are for the output of a machine learning model. One of the most popular attribution scores is the SHAP score, which is an instantiation of the general Shapley value used in coalition game theory. The definition of this score relies on a probability distribution on the entity population. Since the exact distribution is generally unknown, it needs to be assigned subjectively or be estimated from data, which may lead to misleading feature scores. In this paper, we propose a principled framework for reasoning on SHAP scores under unknown entity population distributions. In our framework, we consider an uncertainty region that contains the potential distributions, and the SHAP score of a feature becomes a function defined over this region. We study the basic problems of finding maxima and minima of this function, which allows us to determine tight ranges for the SHAP scores of all features. In particular, we pinpoint the complexity of these problems, and other related ones, showing them to be NP-complete. Finally, we present experiments on a real-world dataset, showing that our framework may contribute to a more robust feature scoring.
△ Less
Submitted 23 January, 2024;
originally announced January 2024.
-
Evaluating Neural Language Models as Cognitive Models of Language Acquisition
Authors:
Héctor Javier Vázquez Martínez,
Annika Lea Heuser,
Charles Yang,
Jordan Kodner
Abstract:
The success of neural language models (LMs) on many technological tasks has brought about their potential relevance as scientific theories of language despite some clear differences between LM training and child language acquisition. In this paper we argue that some of the most prominent benchmarks for evaluating the syntactic capacities of LMs may not be sufficiently rigorous. In particular, we s…
▽ More
The success of neural language models (LMs) on many technological tasks has brought about their potential relevance as scientific theories of language despite some clear differences between LM training and child language acquisition. In this paper we argue that some of the most prominent benchmarks for evaluating the syntactic capacities of LMs may not be sufficiently rigorous. In particular, we show that the template-based benchmarks lack the structural diversity commonly found in the theoretical and psychological studies of language. When trained on small-scale data modeling child language acquisition, the LMs can be readily matched by simple baseline models. We advocate for the use of the readily available, carefully curated datasets that have been evaluated for gradient acceptability by large pools of native speakers and are designed to probe the structural basis of grammar specifically. On one such dataset, the LI-Adger dataset, LMs evaluate sentences in a way inconsistent with human language users. We conclude with suggestions for better connecting LMs with the empirical study of child language acquisition.
△ Less
Submitted 30 October, 2023;
originally announced October 2023.
-
Recognizing unit multiple intervals is hard
Authors:
Virginia Ardévol Martínez,
Romeo Rizzi,
Florian Sikora,
Stéphane Vialette
Abstract:
Multiple interval graphs are a well-known generalization of interval graphs introduced in the 1970s to deal with situations arising naturally in scheduling and allocation. A $d$-interval is the union of $d$ intervals on the real line, and a graph is a $d$-interval graph if it is the intersection graph of $d$-intervals. In particular, it is a unit $d$-interval graph if it admits a $d$-interval repr…
▽ More
Multiple interval graphs are a well-known generalization of interval graphs introduced in the 1970s to deal with situations arising naturally in scheduling and allocation. A $d$-interval is the union of $d$ intervals on the real line, and a graph is a $d$-interval graph if it is the intersection graph of $d$-intervals. In particular, it is a unit $d$-interval graph if it admits a $d$-interval representation where every interval has unit length.
Whereas it has been known for a long time that recognizing 2-interval graphs and other related classes such as 2-track interval graphs is NP-complete, the complexity of recognizing unit 2-interval graphs remains open. Here, we settle this question by proving that the recognition of unit 2-interval graphs is also NP-complete. Our proof technique uses a completely different approach from the other hardness results of recognizing related classes. Furthermore, we extend the result for unit $d$-interval graphs for any $d\geq 2$, which does not follow directly in graph recognition problems --as an example, it took almost 20 years to close the gap between $d=2$ and $d> 2$ for the recognition of $d$-track interval graphs. Our result has several implications, including that recognizing $(x, \dots, x)$ $d$-interval graphs and depth $r$ unit 2-interval graphs is NP-complete for every $x\geq 11$ and every $r\geq 4$.
△ Less
Submitted 21 September, 2023;
originally announced September 2023.
-
Relaxed Agreement Forests
Authors:
Virginia Aardevol Martinez,
Steven Chaplick,
Steven Kelk,
Ruben Meuwese,
Matus Mihalak,
Georgios Stamoulis
Abstract:
There are multiple factors which can cause the phylogenetic inference process to produce two or more conflicting hypotheses of the evolutionary history of a set X of biological entities. That is: phylogenetic trees with the same set of leaf labels X but with distinct topologies. This leads naturally to the goal of quantifying the difference between two such trees T_1 and T_2. Here we introduce the…
▽ More
There are multiple factors which can cause the phylogenetic inference process to produce two or more conflicting hypotheses of the evolutionary history of a set X of biological entities. That is: phylogenetic trees with the same set of leaf labels X but with distinct topologies. This leads naturally to the goal of quantifying the difference between two such trees T_1 and T_2. Here we introduce the problem of computing a 'maximum relaxed agreement forest' (MRAF) and use this as a proxy for the dissimilarity of T_1 and T_2, which in this article we assume to be unrooted binary phylogenetic trees. MRAF asks for a partition of the leaf labels X into a minimum number of blocks S_1, S_2, ... S_k such that for each i, the subtrees induced in T_1 and T_2 by S_i are isomorphic up to suppression of degree-2 nodes and taking the labels X into account. Unlike the earlier introduced maximum agreement forest (MAF) model, the subtrees induced by the S_i are allowed to overlap. We prove that it is NP-hard to compute MRAF, by reducing from the problem of partitioning a permutation into a minimum number of monotonic subsequences (PIMS). Furthermore, we show that MRAF has a polynomial time O(log n)-approximation algorithm where n=|X| and permits exact algorithms with single-exponential running time. When at least one of the two input trees has a caterpillar topology, we prove that testing whether a MRAF has size at most k can be answered in polynomial time when k is fixed. We also note that on two caterpillars the approximability of MRAF is related to that of PIMS. Finally, we establish a number of bounds on MRAF, compare its behaviour to MAF both in theory and in an experimental setting and discuss a number of open problems.
△ Less
Submitted 3 September, 2023;
originally announced September 2023.
-
Which Argumentative Aspects of Hate Speech in Social Media can be reliably identified?
Authors:
Damián Furman,
Pablo Torres,
José A. Rodríguez,
Diego Letzen,
Vanina Martínez,
Laura Alonso Alemany
Abstract:
With the increasing diversity of use cases of large language models, a more informative treatment of texts seems necessary. An argumentative analysis could foster a more reasoned usage of chatbots, text completion mechanisms or other applications. However, it is unclear which aspects of argumentation can be reliably identified and integrated in language models. In this paper, we present an empiric…
▽ More
With the increasing diversity of use cases of large language models, a more informative treatment of texts seems necessary. An argumentative analysis could foster a more reasoned usage of chatbots, text completion mechanisms or other applications. However, it is unclear which aspects of argumentation can be reliably identified and integrated in language models. In this paper, we present an empirical assessment of the reliability with which different argumentative aspects can be automatically identified in hate speech in social media. We have enriched the Hateval corpus (Basile et al. 2019) with a manual annotation of some argumentative components, adapted from Wagemans (2016)'s Periodic Table of Arguments. We show that some components can be identified with reasonable reliability. For those that present a high error ratio, we analyze the patterns of disagreement between expert annotators and errors in automatic procedures, and we propose adaptations of those categories that can be more reliably reproduced.
△ Less
Submitted 5 June, 2023;
originally announced June 2023.
-
Transfer Learning for Fine-grained Classification Using Semi-supervised Learning and Visual Transformers
Authors:
Manuel Lagunas,
Brayan Impata,
Victor Martinez,
Virginia Fernandez,
Christos Georgakis,
Sofia Braun,
Felipe Bertrand
Abstract:
Fine-grained classification is a challenging task that involves identifying subtle differences between objects within the same category. This task is particularly challenging in scenarios where data is scarce. Visual transformers (ViT) have recently emerged as a powerful tool for image classification, due to their ability to learn highly expressive representations of visual data using self-attenti…
▽ More
Fine-grained classification is a challenging task that involves identifying subtle differences between objects within the same category. This task is particularly challenging in scenarios where data is scarce. Visual transformers (ViT) have recently emerged as a powerful tool for image classification, due to their ability to learn highly expressive representations of visual data using self-attention mechanisms. In this work, we explore Semi-ViT, a ViT model fine tuned using semi-supervised learning techniques, suitable for situations where we have lack of annotated data. This is particularly common in e-commerce, where images are readily available but labels are noisy, nonexistent, or expensive to obtain. Our results demonstrate that Semi-ViT outperforms traditional convolutional neural networks (CNN) and ViTs, even when fine-tuned with limited annotated data. These findings indicate that Semi-ViTs hold significant promise for applications that require precise and fine-grained classification of visual data.
△ Less
Submitted 17 May, 2023;
originally announced May 2023.
-
Matrices inducing generalized metric on sequences
Authors:
Eloi Araujo,
Fábio V. Martinez,
Carlos H. A. Higa,
José Soares
Abstract:
Sequence comparison is a basic task to capture similarities and differences between two or more sequences of symbols, with countless applications such as in computational biology. An alignment is a way to compare sequences, where a giving scoring function determines the degree of similarity between them. Many scoring functions are obtained from scoring matrices. However,not all scoring matrices in…
▽ More
Sequence comparison is a basic task to capture similarities and differences between two or more sequences of symbols, with countless applications such as in computational biology. An alignment is a way to compare sequences, where a giving scoring function determines the degree of similarity between them. Many scoring functions are obtained from scoring matrices. However,not all scoring matrices induce scoring functions which are distances, since the scoring function is not necessarily a metric. In this work we establish necessary and sufficient conditions for scoring matrices to induce each one of the properties of a metric in weighted edit distances. For a subset of scoring matrices that induce normalized edit distances, we also characterize each class of scoring matrices inducing normalized edit distances. Furthermore, we define an extended edit distance, which takes into account a set of editing operations that transforms one sequence into another regardless of the existence of a usual corresponding alignment to represent them, describing a criterion to find a sequence of edit operations whose weight is minimum. Similarly, we determine the class of scoring matrices that induces extended edit distances for each of the properties of a metric.
△ Less
Submitted 15 March, 2023;
originally announced March 2023.
-
A multiclass Q-NLP sentiment analysis experiment using DisCoCat
Authors:
Victor Martinez,
Guilhaume Leroy-Meline
Abstract:
Sentiment analysis is a branch of Natural Language Processing (NLP) which goal is to assign sentiments or emotions to particular sentences or words. Performing this task is particularly useful for companies wishing to take into account customer feedback through chatbots or verbatim. This has been done extensively in the literature using various approaches, ranging from simple models to deep transf…
▽ More
Sentiment analysis is a branch of Natural Language Processing (NLP) which goal is to assign sentiments or emotions to particular sentences or words. Performing this task is particularly useful for companies wishing to take into account customer feedback through chatbots or verbatim. This has been done extensively in the literature using various approaches, ranging from simple models to deep transformer neural networks. In this paper, we will tackle sentiment analysis in the Noisy Intermediate Scale Computing (NISQ) era, using the DisCoCat model of language. We will first present the basics of quantum computing and the DisCoCat model. This will enable us to define a general framework to perform NLP tasks on a quantum computer. We will then extend the two-class classification that was performed by Lorenz et al. (2021) to a four-class sentiment analysis experiment on a much larger dataset, showing the scalability of such a framework.
△ Less
Submitted 7 September, 2022;
originally announced September 2022.
-
Lower bound for constant-size local certification
Authors:
Virgina Ardévol Martínez,
Marco Caoduro,
Laurent Feuilloley,
Jonathan Narboni,
Pegah Pournajafi,
Jean-Florent Raymond
Abstract:
Given a network property or a data structure, a local certification is a labeling that allows to efficiently check that the property is satisfied, or that the structure is correct. The quality of a certification is measured by the size of its labels: the smaller, the better.This notion plays a central role in self-stabilization, because the size of the certification is a lower bound (and often an…
▽ More
Given a network property or a data structure, a local certification is a labeling that allows to efficiently check that the property is satisfied, or that the structure is correct. The quality of a certification is measured by the size of its labels: the smaller, the better.This notion plays a central role in self-stabilization, because the size of the certification is a lower bound (and often an upper bound) on the memory needed for silent self-stabilizing construction of distributed data structures. From the point of view of distributed computing in general, it is also a measure of the locality of a property (e.g. properties of the network itself, such as planarity).
When it comes to the size of the certification labels, one can identify three important regimes: the properties for which the optimal size is polynomial in the number of vertices of the graph, the ones that require only polylogarithmic size, and the ones that can be certified with a constant number of bits. The first two regimes are well studied, with several upper and lower bounds, specific techniques, and active research questions. On the other hand, the constant regime has never been really explored, at least on the lower bound side. The main contribution of this paper is the first non-trivial lower bound for this low regime. More precisely, we show that by using certification on just one bit (a binary certification), one cannot certify $k$-colorability for $k\geq 3$. To do so, we develop a new technique, based on the notion of score, and both local symmetry arguments and a global parity argument. We hope that this technique will be useful for establishing stronger results. We complement this result by a discussion of the implication of lower bounds for this constant-size regime, and with an upper bound for a related problem, illustrating that in some cases one can do better than the natural upper bound.
△ Less
Submitted 30 August, 2022;
originally announced August 2022.
-
Parsimonious Argument Annotations for Hate Speech Counter-narratives
Authors:
Damian A. Furman,
Pablo Torres,
Jose A. Rodriguez,
Lautaro Martinez,
Laura Alonso Alemany,
Diego Letzen,
Maria Vanina Martinez
Abstract:
We present an enrichment of the Hateval corpus of hate speech tweets (Basile et. al 2019) aimed to facilitate automated counter-narrative generation. Comparably to previous work (Chung et. al. 2019), manually written counter-narratives are associated to tweets. However, this information alone seems insufficient to obtain satisfactory language models for counter-narrative generation. That is why we…
▽ More
We present an enrichment of the Hateval corpus of hate speech tweets (Basile et. al 2019) aimed to facilitate automated counter-narrative generation. Comparably to previous work (Chung et. al. 2019), manually written counter-narratives are associated to tweets. However, this information alone seems insufficient to obtain satisfactory language models for counter-narrative generation. That is why we have also annotated tweets with argumentative information based on Wagemanns (2016), that we believe can help in building convincing and effective counter-narratives for hate speech against particular groups.
We discuss adequacies and difficulties of this annotation process and present several baselines for automatic detection of the annotated elements. Preliminary results show that automatic annotators perform close to human annotators to detect some aspects of argumentation, while others only reach low or moderate level of inter-annotator agreement.
△ Less
Submitted 1 August, 2022;
originally announced August 2022.
-
Adaptive Repetitions Strategies in IEEE 802.11bd
Authors:
Wu Zhuofei,
Stefania Bartoletti,
Vincent Martinez,
Alessandro Bazzi
Abstract:
A new backward compatible WiFi amendment is under development by the IEEE bd Task Group towards the so-called IEEE 802.11bd, which includes the possibility to transmit up to three repetitions of the same packet. This feature increases time diversity and enables the use of maximum ratio combining (MRC) at the receiver to improve the probability of correct decoding. In this work, we first investigat…
▽ More
A new backward compatible WiFi amendment is under development by the IEEE bd Task Group towards the so-called IEEE 802.11bd, which includes the possibility to transmit up to three repetitions of the same packet. This feature increases time diversity and enables the use of maximum ratio combining (MRC) at the receiver to improve the probability of correct decoding. In this work, we first investigate the packet repetition feature and analyze how it looses its efficacy increasing the traffic as an higher number of transmissions may augment the channel load and collision probability. Then, we propose two strategies for adaptively selecting the number of transmissions leveraging on an adapted version of the channel busy ratio (CBR), which is measured at the transmitter and is an indicator of the channel load. The proposed strategies are validated through network-level simulations that account for both the acquisition and decoding processes. Results show that the proposed strategies ensure that devices use optimal settings under variable traffic conditions.
△ Less
Submitted 14 July, 2022;
originally announced July 2022.
-
On the complexity of finding set repairs for data-graphs
Authors:
Sergio Abriola,
Santiago Cifuentes,
María Vanina Martínez,
Nina Pardal,
Edwin Pin
Abstract:
In the deeply interconnected world we live in, pieces of information link domains all around us. As graph databases embrace effectively relationships among data and allow processing and querying these connections efficiently, they are rapidly becoming a popular platform for storage that supports a wide range of domains and applications. As in the relational case, it is expected that data preserves…
▽ More
In the deeply interconnected world we live in, pieces of information link domains all around us. As graph databases embrace effectively relationships among data and allow processing and querying these connections efficiently, they are rapidly becoming a popular platform for storage that supports a wide range of domains and applications. As in the relational case, it is expected that data preserves a set of integrity constraints that define the semantic structure of the world it represents. When a database does not satisfy its integrity constraints, a possible approach is to search for a 'similar' database that does satisfy the constraints, also known as a repair. In this work, we study the problem of computing subset and superset repairs for graph databases with data values using a notion of consistency based on a set of Reg-GXPath expressions as integrity constraints. We show that for positive fragments of Reg-GXPath these problems admit a polynomial-time algorithm, while the full expressive power of the language renders them intractable.
△ Less
Submitted 15 June, 2022;
originally announced June 2022.
-
A Methodology for Abstracting the Physical Layer of Direct V2X Communications Technologies
Authors:
Wu Zhuofei,
Stefania Bartoletti,
Vincent Martinez,
Alessandro Bazzi
Abstract:
Recent advancements in V2X communications have greatly increased the flexibility of the physical and medium access control (MAC) layers. This increases the complexity when investigating the system from a network perspective to evaluate the performance of the supported applications. Such flexibility needs in fact to be taken into account through a cross-layer approach, which might lead to challengi…
▽ More
Recent advancements in V2X communications have greatly increased the flexibility of the physical and medium access control (MAC) layers. This increases the complexity when investigating the system from a network perspective to evaluate the performance of the supported applications. Such flexibility needs in fact to be taken into account through a cross-layer approach, which might lead to challenging evaluation processes. As an accurate simulation of the signals appears unfeasible, a typical solution is to rely on simple models for incorporating the physical layer of the supported technologies, based on off-line measurements or accurate link-level simulations. Such data is however limited to a subset of possible configurations and extending them to others is costly when not even impossible. The goal of this paper is to develop a new approach for modelling the physical layer of vehicle-to-everything (V2X) communications that can be extended to a wide range of configurations without leading to extensive measurement or simulation campaign at the link layer. In particular, given a scenario and starting from results in terms of packet error rate (PER) vs. signal-to-interference-plus-noise ratio (SINR) related to a subset of possible configurations, we derive one parameter, called implementation loss, that is then used to evaluate the network performance under any configuration in the same scenario. The proposed methodology, leading to a good trade-off among complexity, generality, and accuracy of the performance evaluation process, has been validated through extensive simulations with both IEEE 802.11p and LTE-V2X sidelink technologies in various scenarios.
△ Less
Submitted 24 October, 2022; v1 submitted 31 March, 2022;
originally announced March 2022.
-
Scalable Query Answering under Uncertainty to Neuroscientific Ontological Knowledge: The NeuroLang Approach
Authors:
Gaston Zanitti,
Yamil Soto,
Valentin Iovene,
Maria Vanina Martinez,
Ricardo Rodriguez,
Gerardo Simari,
Demian Wassermann
Abstract:
Researchers in neuroscience have a growing number of datasets available to study the brain, which is made possible by recent technological advances. Given the extent to which the brain has been studied, there is also available ontological knowledge encoding the current state of the art regarding its different areas, activation patterns, key words associated with studies, etc. Furthermore, there is…
▽ More
Researchers in neuroscience have a growing number of datasets available to study the brain, which is made possible by recent technological advances. Given the extent to which the brain has been studied, there is also available ontological knowledge encoding the current state of the art regarding its different areas, activation patterns, key words associated with studies, etc. Furthermore, there is an inherent uncertainty associated with brain scans arising from the map** between voxels -- 3D pixels -- and actual points in different individual brains. Unfortunately, there is currently no unifying framework for accessing such collections of rich heterogeneous data under uncertainty, making it necessary for researchers to rely on ad hoc tools. In particular, one major weakness of current tools that attempt to address this kind of task is that only very limited propositional query languages have been developed. In this paper, we present NeuroLang, an ontology language with existential rules, probabilistic uncertainty, and built-in mechanisms to guarantee tractable query answering over very large datasets. After presenting the language and its general query answering architecture, we discuss real-world use cases showing how NeuroLang can be applied to practical scenarios for which current tools are inadequate.
△ Less
Submitted 23 February, 2022;
originally announced February 2022.
-
Performance Analysis of IEEE 802.11p Preamble Insertion in C-V2X Sidelink Signals for Co-Channel Coexistence
Authors:
Alessandro Bazzi,
Stefania Bartoletti,
Alberto Zanella,
Vincent Martinez
Abstract:
Spectrum scarcity is one of the main challenges of future wireless technologies. When looking at vehicle-to-everything (V2X), this is amplified as spectrum sharing could impact road safety and traffic efficiency. It is therefore of particular importance to study solutions that allow the coexistence, in the same geographical area and in the same channels, of what are today the main V2X access techn…
▽ More
Spectrum scarcity is one of the main challenges of future wireless technologies. When looking at vehicle-to-everything (V2X), this is amplified as spectrum sharing could impact road safety and traffic efficiency. It is therefore of particular importance to study solutions that allow the coexistence, in the same geographical area and in the same channels, of what are today the main V2X access technologies, namely IEEE 802.11p and sidelink LTE-V2X Mode 4. In this work, in addition to investigating the impact of the reciprocal interference, which we demonstrate to have a strong impact especially on the first and in congested channel conditions, a mitigation solution is extensively studied, which is based on the insertion of the IEEE 802.11p preamble at the beginning of the LTE-V2X sidelink transmission. The proposal, which is also under discussion within the standardization bodies, requires no modifications to the IEEE 802.11p protocol stack and minor changes to LTE-V2X sidelink. This solution is directly applicable to upcoming IEEE 802.11bd and extendable to NR-V2X sidelink. The paper shows, through analysis and simulations in free-flow and dense scenarios, that the proposal allows for a mitigation of collisions caused by co-channel coexistence under low to high-load channel conditions and that the improvement is also granted in congested cases when combined with additional countermeasures. Regarding the latter aspect, in particular, different approaches are compared, demonstrating that acting on the congestion control mechanisms is a simple but effective solution.
△ Less
Submitted 6 April, 2022; v1 submitted 18 January, 2022;
originally announced January 2022.
-
An epistemic approach to model uncertainty in data-graphs
Authors:
Sergio Abriola,
Santiago Cifuentes,
María Vanina Martínez,
Nina Pardal,
Edwin Pin
Abstract:
Graph databases are becoming widely successful as data models that allow to effectively represent and process complex relationships among various types of data. As with any other type of data repository, graph databases may suffer from errors and discrepancies with respect to the real-world data they intend to represent. In this work we explore the notion of probabilistic unclean graph databases,…
▽ More
Graph databases are becoming widely successful as data models that allow to effectively represent and process complex relationships among various types of data. As with any other type of data repository, graph databases may suffer from errors and discrepancies with respect to the real-world data they intend to represent. In this work we explore the notion of probabilistic unclean graph databases, previously proposed for relational databases, in order to capture the idea that the observed (unclean) graph database is actually the noisy version of a clean one that correctly models the world but that we know partially. As the factors that may be involved in the observation can be many, e.g, all different types of clerical errors or unintended transformations of the data, we assume a probabilistic model that describes the distribution over all possible ways in which the clean (uncertain) database could have been polluted. Based on this model we define two computational problems: data cleaning and probabilistic query answering and study for both of them their corresponding complexity when considering that the transformation of the database can be caused by either removing (subset) or adding (superset) nodes and edges.
△ Less
Submitted 7 January, 2023; v1 submitted 28 September, 2021;
originally announced September 2021.
-
On the Importance of Domain-specific Explanations in AI-based Cybersecurity Systems (Technical Report)
Authors:
Jose N. Paredes,
Juan Carlos L. Teze,
Gerardo I. Simari,
Maria Vanina Martinez
Abstract:
With the availability of large datasets and ever-increasing computing power, there has been a growing use of data-driven artificial intelligence systems, which have shown their potential for successful application in diverse areas. However, many of these systems are not able to provide information about the rationale behind their decisions to their users. Lack of understanding of such decisions ca…
▽ More
With the availability of large datasets and ever-increasing computing power, there has been a growing use of data-driven artificial intelligence systems, which have shown their potential for successful application in diverse areas. However, many of these systems are not able to provide information about the rationale behind their decisions to their users. Lack of understanding of such decisions can be a major drawback, especially in critical domains such as those related to cybersecurity. In light of this problem, in this paper we make three contributions: (i) proposal and discussion of desiderata for the explanation of outputs generated by AI-based cybersecurity systems; (ii) a comparative analysis of approaches in the literature on Explainable Artificial Intelligence (XAI) under the lens of both our desiderata and further dimensions that are typically used for examining XAI approaches; and (iii) a general architecture that can serve as a roadmap for guiding research efforts towards the development of explainable AI-based cybersecurity systems -- at its core, this roadmap proposes combinations of several research lines in a novel way towards tackling the unique challenges that arise in this context.
△ Less
Submitted 2 August, 2021;
originally announced August 2021.
-
Algorithms for normalized multiple sequence alignments
Authors:
Eloi Araujo,
Luiz Rozante,
Diego P. Rubert,
Fabio V. Martinez
Abstract:
Sequence alignment supports numerous tasks in bioinformatics, natural language processing, pattern recognition, social sciences, and others fields. While the alignment of two sequences may be performed swiftly in many applications, the simultaneous alignment of multiple sequences proved to be naturally more intricate. Although most multiple sequence alignment (MSA) formulations are NP-hard, severa…
▽ More
Sequence alignment supports numerous tasks in bioinformatics, natural language processing, pattern recognition, social sciences, and others fields. While the alignment of two sequences may be performed swiftly in many applications, the simultaneous alignment of multiple sequences proved to be naturally more intricate. Although most multiple sequence alignment (MSA) formulations are NP-hard, several approaches have been developed, as they can outperform pairwise alignment methods or are necessary for some applications.
Taking into account not only similarities but also the lengths of the compared sequences (i.e. normalization) can provide better alignment results than both unnormalized or post-normalized approaches. While some normalized methods have been developed for pairwise sequence alignment, none have been proposed for MSA. This work is a first effort towards the development of normalized methods for MSA.
We discuss multiple aspects of normalized multiple sequence alignment (NMSA). We define three new criteria for computing normalized scores when aligning multiple sequences, showing the NP-hardness and exact algorithms for solving the NMSA using those criteria. In addition, we provide approximation algorithms for MSA and NMSA for some classes of scoring matrices.
△ Less
Submitted 3 December, 2021; v1 submitted 4 July, 2021;
originally announced July 2021.
-
pysentimiento: A Python Toolkit for Opinion Mining and Social NLP tasks
Authors:
Juan Manuel Pérez,
Mariela Rajngewerc,
Juan Carlos Giudici,
Damián A. Furman,
Franco Luque,
Laura Alonso Alemany,
María Vanina Martínez
Abstract:
In recent years, the extraction of opinions and information from user-generated text has attracted a lot of interest, largely due to the unprecedented volume of content in Social Media. However, social researchers face some issues in adopting cutting-edge tools for these tasks, as they are usually behind commercial APIs, unavailable for other languages than English, or very complex to use for non-…
▽ More
In recent years, the extraction of opinions and information from user-generated text has attracted a lot of interest, largely due to the unprecedented volume of content in Social Media. However, social researchers face some issues in adopting cutting-edge tools for these tasks, as they are usually behind commercial APIs, unavailable for other languages than English, or very complex to use for non-experts. To address these issues, we present pysentimiento, a comprehensive multilingual Python toolkit designed for opinion mining and other Social NLP tasks. This open-source library brings state-of-the-art models for Spanish, English, Italian, and Portuguese in an easy-to-use Python library, allowing researchers to leverage these techniques. We present a comprehensive assessment of performance for several pre-trained language models across a variety of tasks, languages, and datasets, including an evaluation of fairness in the results.
△ Less
Submitted 25 October, 2023; v1 submitted 17 June, 2021;
originally announced June 2021.
-
Automated Quality Assessment of Cognitive Behavioral Therapy Sessions Through Highly Contextualized Language Representations
Authors:
Nikolaos Flemotomos,
Victor R. Martinez,
Zhuohao Chen,
Torrey A. Creed,
David C. Atkins,
Shrikanth Narayanan
Abstract:
During a psychotherapy session, the counselor typically adopts techniques which are codified along specific dimensions (e.g., 'displays warmth and confidence', or 'attempts to set up collaboration') to facilitate the evaluation of the session. Those constructs, traditionally scored by trained human raters, reflect the complex nature of psychotherapy and highly depend on the context of the interact…
▽ More
During a psychotherapy session, the counselor typically adopts techniques which are codified along specific dimensions (e.g., 'displays warmth and confidence', or 'attempts to set up collaboration') to facilitate the evaluation of the session. Those constructs, traditionally scored by trained human raters, reflect the complex nature of psychotherapy and highly depend on the context of the interaction. Recent advances in deep contextualized language models offer an avenue for accurate in-domain linguistic representations which can lead to robust recognition and scoring of such psychotherapy-relevant behavioral constructs, and support quality assurance and supervision. In this work, we propose a BERT-based model for automatic behavioral scoring of a specific type of psychotherapy, called Cognitive Behavioral Therapy (CBT), where prior work is limited to frequency-based language features and/or short text excerpts which do not capture the unique elements involved in a spontaneous long conversational interaction. The model focuses on the classification of therapy sessions with respect to the overall score achieved on the widely-used Cognitive Therapy Rating Scale (CTRS), but is trained in a multi-task manner in order to achieve higher interpretability. BERT-based representations are further augmented with available therapy metadata, providing relevant non-linguistic context and leading to consistent performance improvements. We train and evaluate our models on a set of 1,118 real-world therapy sessions, recorded and automatically transcribed. Our best model achieves an F1 score equal to 72.61% on the binary classification task of low vs. high total CTRS.
△ Less
Submitted 4 October, 2021; v1 submitted 23 February, 2021;
originally announced February 2021.
-
Automated Evaluation Of Psychotherapy Skills Using Speech And Language Technologies
Authors:
Nikolaos Flemotomos,
Victor R. Martinez,
Zhuohao Chen,
Karan Singla,
Victor Ardulov,
Raghuveer Peri,
Derek D. Caperton,
James Gibson,
Michael J. Tanana,
Panayiotis Georgiou,
Jake Van Epps,
Sarah P. Lord,
Tad Hirsch,
Zac E. Imel,
David C. Atkins,
Shrikanth Narayanan
Abstract:
With the growing prevalence of psychological interventions, it is vital to have measures which rate the effectiveness of psychological care to assist in training, supervision, and quality assurance of services. Traditionally, quality assessment is addressed by human raters who evaluate recorded sessions along specific dimensions, often codified through constructs relevant to the approach and domai…
▽ More
With the growing prevalence of psychological interventions, it is vital to have measures which rate the effectiveness of psychological care to assist in training, supervision, and quality assurance of services. Traditionally, quality assessment is addressed by human raters who evaluate recorded sessions along specific dimensions, often codified through constructs relevant to the approach and domain. This is however a cost-prohibitive and time-consuming method that leads to poor feasibility and limited use in real-world settings. To facilitate this process, we have developed an automated competency rating tool able to process the raw recorded audio of a session, analyzing who spoke when, what they said, and how the health professional used language to provide therapy. Focusing on a use case of a specific type of psychotherapy called Motivational Interviewing, our system gives comprehensive feedback to the therapist, including information about the dynamics of the session (e.g., therapist's vs. client's talking time), low-level psychological language descriptors (e.g., type of questions asked), as well as other high-level behavioral constructs (e.g., the extent to which the therapist understands the clients' perspective). We describe our platform and its performance using a dataset of more than 5,000 recordings drawn from its deployment in a real-world clinical setting used to assist training of new therapists. Widespread use of automated psychotherapy rating tools may augment experts' capabilities by providing an avenue for more effective training and skill improvement, eventually leading to more positive clinical outcomes.
△ Less
Submitted 27 March, 2021; v1 submitted 22 February, 2021;
originally announced February 2021.
-
Victim or Perpetrator? Analysis of Violent Characters Portrayals from Movie Scripts
Authors:
Victor R Martinez,
Krishna Somandepalli,
Karan Singla,
Anil Ramanakrishna,
Yalda T. Uhls,
Shrikanth Narayanan
Abstract:
Violent content in the media can influence viewers' perception of the society. For example, frequent depictions of certain demographics as victims or perpetrators of violence can shape stereotyped attitudes. We propose that computational methods can aid in the large-scale analysis of violence in movies. The method we develop characterizes aspects of violent content solely from the language used in…
▽ More
Violent content in the media can influence viewers' perception of the society. For example, frequent depictions of certain demographics as victims or perpetrators of violence can shape stereotyped attitudes. We propose that computational methods can aid in the large-scale analysis of violence in movies. The method we develop characterizes aspects of violent content solely from the language used in the scripts. Thus, our method is applicable to a movie in the earlier stages of content creation even before it is produced. This is complementary to previous works which rely on audio or video post production. In this work, we identify stereotypes in character roles (i.e., victim, perpetrator and narrator) based on the demographics of the actor casted for that role. Our results highlight two significant differences in the frequency of portrayals as well as the demographics of the interaction between victims and perpetrators : (1) female characters appear more often as victims, and (2) perpetrators are more likely to be White if the victim is Black or Latino. To date, we are the first to show that language used in movie scripts is a strong indicator of violent content, and that there are systematic portrayals of certain demographics as victims and perpetrators in a large dataset. This offers novel computational tools to assist in creating awareness of representations in storytelling
△ Less
Submitted 29 August, 2020; v1 submitted 18 August, 2020;
originally announced August 2020.
-
Natural family-free genomic distance
Authors:
Diego P. Rubert,
Fábio V. Martinez,
Marília D. V. Braga
Abstract:
A classical problem in comparative genomics is to compute the rearrangement distance, that is the minimum number of large-scale rearrangements required to transform a given genome into another given genome.
While the most traditional approaches in this area are family-based, i.e., require the classification of DNA fragments into families, more recently an alternative family-free approach was pro…
▽ More
A classical problem in comparative genomics is to compute the rearrangement distance, that is the minimum number of large-scale rearrangements required to transform a given genome into another given genome.
While the most traditional approaches in this area are family-based, i.e., require the classification of DNA fragments into families, more recently an alternative family-free approach was proposed, and consists of studying the rearrangement distances without prior family assignment. On the one hand the computation of genomic distances in the family-free setting helps to match occurrences of duplicated genes and find homologies, but on the other hand this computation is NP-hard. In this paper, by letting structural rearrangements be represented by the generic double cut and join (DCJ) operation and also allowing insertions and deletions of DNA segments, we propose a new and more general family-free genomic distance, providing an efficient ILP formulation to solve it.
Our experiments show that the ILP produces accurate results and can handle not only bacterial genomes, but also fungi and insects, or subsets of chromosomes of mammals and plants.
△ Less
Submitted 14 July, 2020; v1 submitted 7 July, 2020;
originally announced July 2020.
-
On motifs in colored graphs
Authors:
Diego P Rubert,
Eloi Araujo,
Marco A Stefanes,
Jens Stoye,
Fábio V Martinez
Abstract:
One of the most important concepts in biological network analysis is that of network motifs, which are patterns of interconnections that occur in a given network at a frequency higher than expected in a random network. In this work we are interested in searching and inferring network motifs in a class of biological networks that can be represented by vertex-colored graphs. We show the computationa…
▽ More
One of the most important concepts in biological network analysis is that of network motifs, which are patterns of interconnections that occur in a given network at a frequency higher than expected in a random network. In this work we are interested in searching and inferring network motifs in a class of biological networks that can be represented by vertex-colored graphs. We show the computational complexity for many problems related to colorful topological motifs and present efficient algorithms for special cases. We also present a probabilistic strategy to detect highly frequent motifs in vertex-colored graphs. Experiments on real data sets show that our algorithms are very competitive both in efficiency and in quality of the solutions.
△ Less
Submitted 27 May, 2020;
originally announced May 2020.
-
Empirical Models for the Realistic Generation of Cooperative Awareness Messages in Vehicular Networks
Authors:
Rafael Molina-Masegosa,
Miguel Sepulcre,
Javier Gozalvez,
Friedbert Berens,
Vincent Martinez
Abstract:
Most V2X (Vehicle-to-Everything) applications rely on broadcasting awareness messages known as CAM (Cooperative Awareness Messages) in ETSI or BSM (Basic Safety Message) in SAE standards. A large number of studies have been devoted to guarantee their reliable transmission. However, to date, the studies are generally based on simplified data traffic models that generate awareness messages at period…
▽ More
Most V2X (Vehicle-to-Everything) applications rely on broadcasting awareness messages known as CAM (Cooperative Awareness Messages) in ETSI or BSM (Basic Safety Message) in SAE standards. A large number of studies have been devoted to guarantee their reliable transmission. However, to date, the studies are generally based on simplified data traffic models that generate awareness messages at periodic intervals or with a constant message size. These models do not accurately represent the real generation of CAM messages that follow specific mobility-based rules. Using simplified and unrealistic traffic models can significantly impact the results and validity of the studies, and hence accurate models for the generation of awareness messages are necessary. This paper proposes the first set of models that can realistically generate CAM messages. The models have been created from real traces collected by two car manufacturers in urban, sub-urban and highway test drives. The models are based on mth order Markov sources, and model the size of CAMs and the time interval between CAMs. The models are openly provided to the community and can be easily integrated into any simulator.
△ Less
Submitted 15 April, 2020;
originally announced April 2020.
-
Co-channel Coexistence: Let ITS-G5 and Sidelink C-V2X Make Peace
Authors:
Alessandro Bazzi,
Alberto Zanella,
Ioannis Sarris,
Vincent Martinez
Abstract:
In the last few years, two technologies have been developed to enable direct exchange of information between vehicles. These technologies, currently seen as alternatives, are ITS-G5, as commonly referred in Europe, and sidelink LTE-vehicle-to-everything (LTE-V2X) (one of the solutions of the so-called cellular-V2X, C-V2X). For this reason, the attention has been mostly concentrated on comparing th…
▽ More
In the last few years, two technologies have been developed to enable direct exchange of information between vehicles. These technologies, currently seen as alternatives, are ITS-G5, as commonly referred in Europe, and sidelink LTE-vehicle-to-everything (LTE-V2X) (one of the solutions of the so-called cellular-V2X, C-V2X). For this reason, the attention has been mostly concentrated on comparing them and remarking their strengths and weaknesses to motivate a choice. Differently, in this work we focus on a scenario where both are used in the same area and using the same frequency channels, without the assistance from any infrastructure. Our results show that under co-channel coexistence the range of ITS-G5 is severely degraded, while impact on LTE-V2X is marginal. Additionally, a mitigation method where the CAM data generation is constrained to periodical intervals is shown to reduce the impact of co-channel coexistence, with less degradation on ITS-G5 performance and even improvement for LTE-V2X.
△ Less
Submitted 20 March, 2020;
originally announced March 2020.
-
A system for the 2019 Sentiment, Emotion and Cognitive State Task of DARPAs LORELEI project
Authors:
Victor R Martinez,
Anil Ramakrishna,
Ming-Chang Chiu,
Karan Singla,
Shrikanth Narayanan
Abstract:
During the course of a Humanitarian Assistance-Disaster Relief (HADR) crisis, that can happen anywhere in the world, real-time information is often posted online by the people in need of help which, in turn, can be used by different stakeholders involved with management of the crisis. Automated processing of such posts can considerably improve the effectiveness of such efforts; for example, unders…
▽ More
During the course of a Humanitarian Assistance-Disaster Relief (HADR) crisis, that can happen anywhere in the world, real-time information is often posted online by the people in need of help which, in turn, can be used by different stakeholders involved with management of the crisis. Automated processing of such posts can considerably improve the effectiveness of such efforts; for example, understanding the aggregated emotion from affected populations in specific areas may help inform decision-makers on how to best allocate resources for an effective disaster response. However, these efforts may be severely limited by the availability of resources for the local language. The ongoing DARPA project Low Resource Languages for Emergent Incidents (LORELEI) aims to further language processing technologies for low resource languages in the context of such a humanitarian crisis. In this work, we describe our submission for the 2019 Sentiment, Emotion and Cognitive state (SEC) pilot task of the LORELEI project. We describe a collection of sentiment analysis systems included in our submission along with the features extracted. Our fielded systems obtained the best results in both English and Spanish language evaluations of the SEC pilot task.
△ Less
Submitted 1 May, 2019;
originally announced May 2019.
-
A group law on the projective plane with applications in Public Key Cryptography
Authors:
R. Durán Díaz,
V. Gayoso Martínez,
L. Hernández Encinas,
J. Muñoz Masqué
Abstract:
We present a new group law defined on a subset of the projective plane $\mathbb{F}P^2$ over an arbitrary field $\mathbb{F}$, which lends itself to applications in Public Key Cryptography, in particular to a Diffie-Hellman-like key agreement protocol. We analyze the computational difficulty of solving the mathematical problem underlying the proposed Abelian group law and we prove that the security…
▽ More
We present a new group law defined on a subset of the projective plane $\mathbb{F}P^2$ over an arbitrary field $\mathbb{F}$, which lends itself to applications in Public Key Cryptography, in particular to a Diffie-Hellman-like key agreement protocol. We analyze the computational difficulty of solving the mathematical problem underlying the proposed Abelian group law and we prove that the security of our proposal is equivalent to the discrete logarithm problem in the multiplicative group of the cubic extension of the finite field considered. Finally, we present a variant of the proposed group law but over the ring $\mathbb{Z}/pq\mathbb{Z}$, and explain how the security becomes enhanced, though at the cost of a longer key length.
△ Less
Submitted 10 June, 2019; v1 submitted 1 February, 2018;
originally announced February 2018.
-
An Automorphic Distance Metric and its Application to Node Embedding for Role Mining
Authors:
Víctor Martínez,
Fernando Berzal,
Juan-Carlos Cubero
Abstract:
Role is a fundamental concept in the analysis of the behavior and function of interacting entities represented by network data. Role discovery is the task of uncovering hidden roles. Node roles are commonly defined in terms of equivalence classes, where two nodes have the same role if they fall within the same equivalence class. Automorphic equivalence, where two nodes are equivalent when they can…
▽ More
Role is a fundamental concept in the analysis of the behavior and function of interacting entities represented by network data. Role discovery is the task of uncovering hidden roles. Node roles are commonly defined in terms of equivalence classes, where two nodes have the same role if they fall within the same equivalence class. Automorphic equivalence, where two nodes are equivalent when they can swap their labels to form an isomorphic graph, captures this common notion of role. The binary concept of equivalence is too restrictive and nodes in real-world networks rarely belong to the same equivalence class. Instead, a relaxed definition in terms of similarity or distance is commonly used to compute the degree to which two nodes are equivalent. In this paper, we propose a novel distance metric called automorphic distance, which measures how far two nodes are of being automorphically equivalent. We also study its application to node embedding, showing how our metric can be used to generate vector representations of nodes preserving their roles for data visualization and machine learning. Our experiments confirm that the proposed metric outperforms the RoleSim automorphic equivalence-based metric in the generation of node embeddings for different networks.
△ Less
Submitted 5 September, 2018; v1 submitted 19 December, 2017;
originally announced December 2017.
-
The NOESIS Network-Oriented Exploration, Simulation, and Induction System
Authors:
Víctor Martínez,
Fernando Berzal,
Juan-Carlos Cubero
Abstract:
Network data mining has become an important area of study due to the large number of problems it can be applied to. This paper presents NOESIS, an open source framework for network data mining that provides a large collection of network analysis techniques, including the analysis of network structural properties, community detection methods, link scoring, and link prediction, as well as network vi…
▽ More
Network data mining has become an important area of study due to the large number of problems it can be applied to. This paper presents NOESIS, an open source framework for network data mining that provides a large collection of network analysis techniques, including the analysis of network structural properties, community detection methods, link scoring, and link prediction, as well as network visualization algorithms. It also features a complete stand-alone graphical user interface that facilitates the use of all these techniques. The NOESIS framework has been designed using solid object-oriented design principles and structured parallel programming. As a lightweight library with minimal external dependencies and a permissive software license, NOESIS can be incorporated into other software projects. Released under a BSD license, it is available from http://noesis.ikor.org.
△ Less
Submitted 23 June, 2017; v1 submitted 15 November, 2016;
originally announced November 2016.
-
Tweeting Over The Border: An Empirical Study of Transnational Migration in San Diego and Tijuana
Authors:
Victor R. Martinez,
Antonio Mancilla,
Victor M. Gonzalez
Abstract:
Sociological studies on transnational migration are often based on surveys or interviews, an expensive and time consuming approach. On the other hand, the pervasiveness of mobile phones and location aware social networks has introduced new ways to understand human mobility patterns at a national or global scale. In this work, we leverage geo located information obtained from Twitter as to understa…
▽ More
Sociological studies on transnational migration are often based on surveys or interviews, an expensive and time consuming approach. On the other hand, the pervasiveness of mobile phones and location aware social networks has introduced new ways to understand human mobility patterns at a national or global scale. In this work, we leverage geo located information obtained from Twitter as to understand transnational migration patterns between two border cities (San Diego, USA and Tijuana, Mexico). We obtained 10.9 million geo located tweets from December 2013 to January 2015. Our method infers human mobility by inspecting tweet submissions and user's home locations. Our results depict a trans national community structure that exhibits the formation of a functional metropolitan area that physically transcends international borders. These results show the potential for re analysing sociology phenomena from a technology based empirical perspective.
△ Less
Submitted 21 July, 2015;
originally announced July 2015.
-
Top-k Query Answering in Datalog+/- Ontologies under Subjective Reports (Technical Report)
Authors:
Thomas Lukasiewicz,
Maria Vanina Martinez,
Cristian Molinaro,
Livia Predoiu,
Gerardo I. Simari
Abstract:
The use of preferences in query answering, both in traditional databases and in ontology-based data access, has recently received much attention, due to its many real-world applications. In this paper, we tackle the problem of top-k query answering in Datalog+/- ontologies subject to the querying user's preferences and a collection of (subjective) reports of other users. Here, each report consists…
▽ More
The use of preferences in query answering, both in traditional databases and in ontology-based data access, has recently received much attention, due to its many real-world applications. In this paper, we tackle the problem of top-k query answering in Datalog+/- ontologies subject to the querying user's preferences and a collection of (subjective) reports of other users. Here, each report consists of scores for a list of features, its author's preferences among the features, as well as other information. Theses pieces of information of every report are then combined, along with the querying user's preferences and his/her trust into each report, to rank the query results. We present two alternative such rankings, along with algorithms for top-k (atomic) query answering under these rankings. We also show that, under suitable assumptions, these algorithms run in polynomial time in the data complexity. We finally present more general reports, which are associated with sets of atoms rather than single atoms.
△ Less
Submitted 29 November, 2013;
originally announced December 2013.
-
Heuristic Ranking in Tightly Coupled Probabilistic Description Logics
Authors:
Thomas Lukasiewicz,
Maria Vanina Martinez,
Giorgio Orsi,
Gerardo I. Simari
Abstract:
The Semantic Web effort has steadily been gaining traction in the recent years. In particular,Web search companies are recently realizing that their products need to evolve towards having richer semantic search capabilities. Description logics (DLs) have been adopted as the formal underpinnings for Semantic Web languages used in describing ontologies. Reasoning under uncertainty has recently taken…
▽ More
The Semantic Web effort has steadily been gaining traction in the recent years. In particular,Web search companies are recently realizing that their products need to evolve towards having richer semantic search capabilities. Description logics (DLs) have been adopted as the formal underpinnings for Semantic Web languages used in describing ontologies. Reasoning under uncertainty has recently taken a leading role in this arena, given the nature of data found on theWeb. In this paper, we present a probabilistic extension of the DL EL++ (which underlies the OWL2 EL profile) using Markov logic networks (MLNs) as probabilistic semantics. This extension is tightly coupled, meaning that probabilistic annotations in formulas can refer to objects in the ontology. We show that, even though the tightly coupled nature of our language means that many basic operations are data-intractable, we can leverage a sublanguage of MLNs that allows to rank the atomic consequences of an ontology relative to their probability values (called ranking queries) even when these values are not fully computed. We present an anytime algorithm to answer ranking queries, and provide an upper bound on the error that it incurs, as well as a criterion to decide when results are guaranteed to be correct.
△ Less
Submitted 16 October, 2012;
originally announced October 2012.