Search | arXiv e-print repository

U Can't Gen This? A Survey of Intellectual Property Protection Methods for Data in Generative AI

Authors: Tanja Šarčević, Alicja Karlowicz, Rudolf Mayer, Ricardo Baeza-Yates, Andreas Rauber

Abstract: Large Generative AI (GAI) models have the unparalleled ability to generate text, images, audio, and other forms of media that are increasingly indistinguishable from human-generated content. As these models often train on publicly available data, including copyrighted materials, art and other creative works, they inadvertently risk violating copyright and misappropriation of intellectual property… ▽ More Large Generative AI (GAI) models have the unparalleled ability to generate text, images, audio, and other forms of media that are increasingly indistinguishable from human-generated content. As these models often train on publicly available data, including copyrighted materials, art and other creative works, they inadvertently risk violating copyright and misappropriation of intellectual property (IP). Due to the rapid development of generative AI technology and pressing ethical considerations from stakeholders, protective mechanisms and techniques are emerging at a high pace but lack systematisation. In this paper, we study the concerns regarding the intellectual property rights of training data and specifically focus on the properties of generative models that enable misuse leading to potential IP violations. Then we propose a taxonomy that leads to a systematic review of technical solutions for safeguarding the data from intellectual property violations in GAI. △ Less

Submitted 22 April, 2024; originally announced June 2024.

arXiv:2405.12312 [pdf, other]

A Principled Approach for a New Bias Measure

Authors: Bruno Scarone, Alfredo Viola, Ricardo Baeza-Yates

Abstract: The widespread use of machine learning and data-driven algorithms for decision making has been steadily increasing over many years. The areas in which this is happening are diverse: healthcare, employment, finance, education, the legal system to name a few; and the associated negative side effects are being increasingly harmful for society. Negative data \emph{bias} is one of those, which tends to… ▽ More The widespread use of machine learning and data-driven algorithms for decision making has been steadily increasing over many years. The areas in which this is happening are diverse: healthcare, employment, finance, education, the legal system to name a few; and the associated negative side effects are being increasingly harmful for society. Negative data \emph{bias} is one of those, which tends to result in harmful consequences for specific groups of people. Any mitigation strategy or effective policy that addresses the negative consequences of bias must start with awareness that bias exists, together with a way to understand and quantify it. However, there is a lack of consensus on how to measure data bias and oftentimes the intended meaning is context dependent and not uniform within the research community. The main contributions of our work are: (1) a general algorithmic framework for defining and efficiently quantifying the bias level of a dataset with respect to a protected group; and (2) the definition of a new bias measure. Our results are experimentally validated using nine publicly available datasets and theoretically analyzed, which provide novel insights about the problem. Based on our approach, we also derive a bias mitigation algorithm that might be useful to policymakers. △ Less

Submitted 20 May, 2024; originally announced May 2024.

arXiv:2403.00148 [pdf, ps, other]

Implications of Regulations on the Use of AI and Generative AI for Human-Centered Responsible Artificial Intelligence

Authors: Marios Constantinides, Mohammad Tahaei, Daniele Quercia, Simone Stumpf, Michael Madaio, Sean Kennedy, Lauren Wilcox, Jessica Vitak, Henriette Cramer, Edyta Bogucka, Ricardo Baeza-Yates, Ewa Luger, Jess Holbrook, Michael Muller, Ilana Golbin Blumenfeld, Giada Pistilli

Abstract: With the upcoming AI regulations (e.g., EU AI Act) and rapid advancements in generative AI, new challenges emerge in the area of Human-Centered Responsible Artificial Intelligence (HCR-AI). As AI becomes more ubiquitous, questions around decision-making authority, human oversight, accountability, sustainability, and the ethical and legal responsibilities of AI and their creators become paramount.… ▽ More With the upcoming AI regulations (e.g., EU AI Act) and rapid advancements in generative AI, new challenges emerge in the area of Human-Centered Responsible Artificial Intelligence (HCR-AI). As AI becomes more ubiquitous, questions around decision-making authority, human oversight, accountability, sustainability, and the ethical and legal responsibilities of AI and their creators become paramount. Addressing these questions requires a collaborative approach. By involving stakeholders from various disciplines in the 2\textsuperscript{nd} edition of the HCR-AI Special Interest Group (SIG) at CHI 2024, we aim to discuss the implications of regulations in HCI research, develop new theories, evaluation frameworks, and methods to navigate the complex nature of AI ethics, steering AI development in a direction that is beneficial and sustainable for all of humanity. △ Less

Submitted 29 February, 2024; originally announced March 2024.

Comments: 6 pages

arXiv:2306.13723 [pdf, other]

Human-AI Coevolution

Authors: Dino Pedreschi, Luca Pappalardo, Emanuele Ferragina, Ricardo Baeza-Yates, Albert-Laszlo Barabasi, Frank Dignum, Virginia Dignum, Tina Eliassi-Rad, Fosca Giannotti, Janos Kertesz, Alistair Knott, Yannis Ioannidis, Paul Lukowicz, Andrea Passarella, Alex Sandy Pentland, John Shawe-Taylor, Alessandro Vespignani

Abstract: Human-AI coevolution, defined as a process in which humans and AI algorithms continuously influence each other, increasingly characterises our society, but is understudied in artificial intelligence and complexity science literature. Recommender systems and assistants play a prominent role in human-AI coevolution, as they permeate many facets of daily life and influence human choices on online pla… ▽ More Human-AI coevolution, defined as a process in which humans and AI algorithms continuously influence each other, increasingly characterises our society, but is understudied in artificial intelligence and complexity science literature. Recommender systems and assistants play a prominent role in human-AI coevolution, as they permeate many facets of daily life and influence human choices on online platforms. The interaction between users and AI results in a potentially endless feedback loop, wherein users' choices generate data to train AI models, which, in turn, shape subsequent user preferences. This human-AI feedback loop has peculiar characteristics compared to traditional human-machine interaction and gives rise to complex and often ``unintended'' social outcomes. This paper introduces Coevolution AI as the cornerstone for a new field of study at the intersection between AI and complexity science focused on the theoretical, empirical, and mathematical investigation of the human-AI feedback loop. In doing so, we: (i) outline the pros and cons of existing methodologies and highlight shortcomings and potential ways for capturing feedback loop mechanisms; (ii) propose a reflection at the intersection between complexity science, AI and society; (iii) provide real-world examples for different human-AI ecosystems; and (iv) illustrate challenges to the creation of such a field of study, conceptualising them at increasing levels of abstraction, i.e., technical, epistemological, legal and socio-political. △ Less

Submitted 3 May, 2024; v1 submitted 23 June, 2023; originally announced June 2023.

arXiv:2306.01650 [pdf, other]

Fair multilingual vandalism detection system for Wikipedia

Authors: Mykola Trokhymovych, Muniza Aslam, Ai-Jou Chou, Ricardo Baeza-Yates, Diego Saez-Trumper

Abstract: This paper presents a novel design of the system aimed at supporting the Wikipedia community in addressing vandalism on the platform. To achieve this, we collected a massive dataset of 47 languages, and applied advanced filtering and feature engineering techniques, including multilingual masked language modeling to build the training dataset from human-generated data. The performance of the system… ▽ More This paper presents a novel design of the system aimed at supporting the Wikipedia community in addressing vandalism on the platform. To achieve this, we collected a massive dataset of 47 languages, and applied advanced filtering and feature engineering techniques, including multilingual masked language modeling to build the training dataset from human-generated data. The performance of the system was evaluated through comparison with the one used in production in Wikipedia, known as ORES. Our research results in a significant increase in the number of languages covered, making Wikipedia patrolling more efficient to a wider range of communities. Furthermore, our model outperforms ORES, ensuring that the results provided are not only more accurate but also less biased against certain groups of contributors. △ Less

Submitted 2 June, 2023; originally announced June 2023.

arXiv:2303.15592 [pdf, other]

doi 10.1145/3610914

Uncovering Bias in Personal Informatics

Authors: Sofia Yfantidou, Pavlos Sermpezis, Athena Vakali, Ricardo Baeza-Yates

Abstract: Personal informatics (PI) systems, powered by smartphones and wearables, enable people to lead healthier lifestyles by providing meaningful and actionable insights that break down barriers between users and their health information. Today, such systems are used by billions of users for monitoring not only physical activity and sleep but also vital signs and women's and heart health, among others.… ▽ More Personal informatics (PI) systems, powered by smartphones and wearables, enable people to lead healthier lifestyles by providing meaningful and actionable insights that break down barriers between users and their health information. Today, such systems are used by billions of users for monitoring not only physical activity and sleep but also vital signs and women's and heart health, among others. Despite their widespread usage, the processing of sensitive PI data may suffer from biases, which may entail practical and ethical implications. In this work, we present the first comprehensive empirical and analytical study of bias in PI systems, including biases in raw data and in the entire machine learning life cycle. We use the most detailed framework to date for exploring the different sources of bias and find that biases exist both in the data generation and the model learning and implementation streams. According to our results, the most affected minority groups are users with health issues, such as diabetes, joint issues, and hypertension, and female users, whose data biases are propagated or even amplified by learning models, while intersectional biases can also be observed. △ Less

Submitted 19 July, 2023; v1 submitted 27 March, 2023; originally announced March 2023.

Report number: Volume: 7 Number: 3, Article: 139

Journal ref: IMWUT 2023

arXiv:2302.08157 [pdf, other]

doi 10.1145/3544549.3583178

Human-Centered Responsible Artificial Intelligence: Current & Future Trends

Authors: Mohammad Tahaei, Marios Constantinides, Daniele Quercia, Sean Kennedy, Michael Muller, Simone Stumpf, Q. Vera Liao, Ricardo Baeza-Yates, Lora Aroyo, Jess Holbrook, Ewa Luger, Michael Madaio, Ilana Golbin Blumenfeld, Maria De-Arteaga, Jessica Vitak, Alexandra Olteanu

Abstract: In recent years, the CHI community has seen significant growth in research on Human-Centered Responsible Artificial Intelligence. While different research communities may use different terminology to discuss similar topics, all of this work is ultimately aimed at develo** AI that benefits humanity while being grounded in human rights and ethics, and reducing the potential harms of AI. In this sp… ▽ More In recent years, the CHI community has seen significant growth in research on Human-Centered Responsible Artificial Intelligence. While different research communities may use different terminology to discuss similar topics, all of this work is ultimately aimed at develo** AI that benefits humanity while being grounded in human rights and ethics, and reducing the potential harms of AI. In this special interest group, we aim to bring together researchers from academia and industry interested in these topics to map current and future research trends to advance this important area of research by fostering collaboration and sharing ideas. △ Less

Submitted 16 February, 2023; originally announced February 2023.

Comments: To appear in Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems

arXiv:2203.04135 [pdf, other]

Bots don't Vote, but They Surely Bother! A Study of Anomalous Accounts in a National Referendum

Authors: Eduardo Graells-Garrido, Ricardo Baeza-Yates

Abstract: The Web contains several social media platforms for discussion, exchange of ideas, and content publishing. These platforms are used by people, but also by distributed agents known as bots. Although bots have existed for decades, with many of them being benevolent, their influence in propagating and generating deceptive information in the last years has increased. Here we present a characterization… ▽ More The Web contains several social media platforms for discussion, exchange of ideas, and content publishing. These platforms are used by people, but also by distributed agents known as bots. Although bots have existed for decades, with many of them being benevolent, their influence in propagating and generating deceptive information in the last years has increased. Here we present a characterization of the discussion on Twitter about the 2020 Chilean constitutional referendum. The characterization uses a profile-oriented analysis that enables the isolation of anomalous content using machine learning. As result, we obtain a characterization that matches national vote turnout, and we measure how anomalous accounts (some of which are automated bots) produce content and interact promoting (false) information. △ Less

Submitted 8 March, 2022; originally announced March 2022.

Comments: 5 pages, 9 figures

arXiv:2011.05353 [pdf, other]

Adaptive Community Search in Dynamic Networks

Authors: Ioanna Tsalouchidou, Francesco Bonchi, Ricardo Baeza-Yates

Abstract: Community search is a well-studied problem which, given a static graph and a query set of vertices, requires to find a cohesive (or dense) subgraph containing the query vertices. In this paper we study the problem of community search in temporal dynamic networks. We adapt to the temporal setting the notion of \emph{network inefficiency} which is based on the pairwise shortest-path distance among a… ▽ More Community search is a well-studied problem which, given a static graph and a query set of vertices, requires to find a cohesive (or dense) subgraph containing the query vertices. In this paper we study the problem of community search in temporal dynamic networks. We adapt to the temporal setting the notion of \emph{network inefficiency} which is based on the pairwise shortest-path distance among all the vertices in a solution. For this purpose we define the notion of \emph{shortest-fastest-path distance}: a linear combination of the temporal and spatial dimensions governed by a user-defined parameter. We thus define the \textsc{Minimum Temporal-Inefficiency Subgraph} problem and show that it is \NPhard. We develop an algorithm which exploits a careful transformation of the temporal network to a static directed and weighted graph, and some recent approximation algorithm for finding the minimum Directed Steiner Tree. We finally generalize our framework to the streaming setting in which new snapshots of the temporal graph keep arriving continuously and our goal is to produce a community search solution for the temporal graph corresponding to a sliding time window. △ Less

Submitted 10 November, 2020; originally announced November 2020.

Comments: IEEE BigData 2020

arXiv:2005.10019 [pdf, other]

doi 10.1145/3394231.3397907

Every Colour You Are: Stance Prediction and Turnaround in Controversial Issues

Authors: Eduardo Graells-Garrido, Ricardo Baeza-Yates, Mounia Lalmas

Abstract: Web platforms have allowed political manifestation and debate for decades. Technology changes have brought new opportunities for expression, and the availability of longitudinal data of these debates entice new questions regarding who participates, and who updates their opinion. The aim of this work is to provide a methodology to measure these phenomena, and to test this methodology on a specific… ▽ More Web platforms have allowed political manifestation and debate for decades. Technology changes have brought new opportunities for expression, and the availability of longitudinal data of these debates entice new questions regarding who participates, and who updates their opinion. The aim of this work is to provide a methodology to measure these phenomena, and to test this methodology on a specific topic, abortion, as observed on one of the most popular micro-blogging platforms. To do so, we followed the discussion on Twitter about abortion in two Spanish-speaking countries from 2015 to 2018. Our main insights are two fold. On the one hand, people adopted new technologies to express their stances, particularly colored variations of heart emojis ([green heart] & [purple heart]) in a way that mirrored physical manifestations on abortion. On the other hand, even on issues with strong opinions, opinions can change, and these changes show differences in demographic groups. These findings imply that debate on the Web embraces new ways of stance adherence, and that changes of opinion can be measured and characterized. △ Less

Submitted 19 May, 2020; originally announced May 2020.

Comments: Accepted at WebSci'20

arXiv:1906.03168 [pdf, other]

doi 10.1371/journal.pone.0241687

Predicting risk of dyslexia with an online gamified test

Authors: Luz Rello, Ricardo Baeza-Yates, Abdullah Ali, Jeffrey P. Bigham, Miquel Serra

Abstract: Dyslexia is a specific learning disorder related to school failure. Detection is both crucial and challenging, especially in languages with transparent orthographies, such as Spanish. To make detecting dyslexia easier, we designed an online gamified test and a predictive machine learning model. In a study with more than 3,600 participants, our model correctly detected over 80% of the participants… ▽ More Dyslexia is a specific learning disorder related to school failure. Detection is both crucial and challenging, especially in languages with transparent orthographies, such as Spanish. To make detecting dyslexia easier, we designed an online gamified test and a predictive machine learning model. In a study with more than 3,600 participants, our model correctly detected over 80% of the participants with dyslexia. To check the robustness of the method we tested our method using a new data set with over 1,300 participants with age customized tests in a different environment -- a tablet instead of a desktop computer -- reaching a recall of over 72% for the class with dyslexia for children 9 years old or older. Our work shows that dyslexia can be screened using a machine learning approach. An online screening tool based on our methods has already been used by more than 200,000 people. △ Less

Submitted 9 December, 2019; v1 submitted 7 June, 2019; originally announced June 2019.

arXiv:1807.07162 [pdf, other]

What kind of content are you prone to tweet? Multi-topic Preference Model for Tweeters

Authors: Lorena Recalde, Ricardo Baeza-Yates

Abstract: According to tastes, a person could show preference for a given category of content to a greater or lesser extent. However, quantifying people's amount of interest in a certain topic is a challenging task, especially considering the massive digital information they are exposed to. For example, in the context of Twitter, aligned with his/her preferences a user may tweet and retweet more about techn… ▽ More According to tastes, a person could show preference for a given category of content to a greater or lesser extent. However, quantifying people's amount of interest in a certain topic is a challenging task, especially considering the massive digital information they are exposed to. For example, in the context of Twitter, aligned with his/her preferences a user may tweet and retweet more about technology than sports and do not share any music-related content. The problem we address in this paper is the identification of users' implicit topic preferences by analyzing the content categories they tend to post on Twitter. Our proposal is significant given that modeling their multi-topic profile may be useful to find patterns or association between preferences for categories, discover trending topics and cluster similar users to generate better group recommendations of content. In the present work, we propose a method based on the Mixed Gaussian Model to extract the multidimensional preference representation for 399 Ecuadorian tweeters concerning twenty-two different topics (or dimensions) which became known by manually categorizing 68.186 tweets. Our experiment findings indicate that the proposed approach is effective at detecting the topic interests of users. △ Less

Submitted 18 July, 2018; originally announced July 2018.

Comments: 16 pages, 4 figures, Workshop on Social Aspects in Personalization and Search (SOAPS 2018), collocated with ECIR 2018, Apr 26, Grenoble, France

arXiv:1711.02295 [pdf, other]

Quality-Efficiency Trade-offs in Machine Learning for Text Processing

Authors: Ricardo Baeza-Yates, Zeinab Liaghat

Abstract: Data mining, machine learning, and natural language processing are powerful techniques that can be used together to extract information from large texts. Depending on the task or problem at hand, there are many different approaches that can be used. The methods available are continuously being optimized, but not all these methods have been tested and compared in a set of problems that can be solve… ▽ More Data mining, machine learning, and natural language processing are powerful techniques that can be used together to extract information from large texts. Depending on the task or problem at hand, there are many different approaches that can be used. The methods available are continuously being optimized, but not all these methods have been tested and compared in a set of problems that can be solved using supervised machine learning algorithms. The question is what happens to the quality of the methods if we increase the training data size from, say, 100 MB to over 1 GB? Moreover, are quality gains worth it when the rate of data processing diminishes? Can we trade quality for time efficiency and recover the quality loss by just being able to process more data? We attempt to answer these questions in a general way for text processing tasks, considering the trade-offs involving training data size, learning time, and quality obtained. We propose a performance trade-off framework and apply it to three important text processing problems: Named Entity Recognition, Sentiment Analysis and Document Classification. These problems were also chosen because they have different levels of object granularity: words, paragraphs, and documents. For each problem, we selected several supervised machine learning algorithms and we evaluated the trade-offs of them on large publicly available data sets (news, reviews, patents). To explore these trade-offs, we use different data subsets of increasing size ranging from 50 MB to several GB. We also consider the impact of the data set and the evaluation technique. We find that the results do not change significantly and that most of the time the best algorithms is the fastest. However, we also show that the results for small data (say less than 100 MB) are different from the results for big data and in those cases the best algorithm is much harder to determine. △ Less

Submitted 7 November, 2017; originally announced November 2017.

Comments: Ten pages, long version of paper that will be presented at IEEE Big Data 2017 (8 pages)

arXiv:1707.08810 [pdf, other]

doi 10.1145/3078714.3078735

Detection of Trending Topic Communities: Bridging Content Creators and Distributors

Authors: Lorena Recalde, David F. Nettleton, Ricardo Baeza-Yates, Ludovico Boratto

Abstract: The rise of a trending topic on Twitter or Facebook leads to the temporal emergence of a set of users currently interested in that topic. Given the temporary nature of the links between these users, being able to dynamically identify communities of users related to this trending topic would allow for a rapid spread of information. Indeed, individual users inside a community might receive recommend… ▽ More The rise of a trending topic on Twitter or Facebook leads to the temporal emergence of a set of users currently interested in that topic. Given the temporary nature of the links between these users, being able to dynamically identify communities of users related to this trending topic would allow for a rapid spread of information. Indeed, individual users inside a community might receive recommendations of content generated by the other users, or the community as a whole could receive group recommendations, with new content related to that trending topic. In this paper, we tackle this challenge, by identifying coherent topic-dependent user groups, linking those who generate the content (creators) and those who spread this content, e.g., by retweeting/reposting it (distributors). This is a novel problem on group-to-group interactions in the context of recommender systems. Analysis on real-world Twitter data compare our proposal with a baseline approach that considers the retweeting activity, and validate it with standard metrics. Results show the effectiveness of our approach to identify communities interested in a topic where each includes content creators and content distributors, facilitating users' interactions and the spread of new information. △ Less

Submitted 27 July, 2017; originally announced July 2017.

Comments: 9 pages, 4 figures, 2 tables, Hypertext 2017 conference

arXiv:1706.06368 [pdf, other]

doi 10.1145/3132847.3132938

FA*IR: A Fair Top-k Ranking Algorithm

Authors: Meike Zehlike, Francesco Bonchi, Carlos Castillo, Sara Hajian, Mohamed Megahed, Ricardo Baeza-Yates

Abstract: In this work, we define and solve the Fair Top-k Ranking problem, in which we want to determine a subset of k candidates from a large pool of n >> k candidates, maximizing utility (i.e., select the "best" candidates) subject to group fairness criteria. Our ranked group fairness definition extends group fairness using the standard notion of protected groups and is based on ensuring that the proport… ▽ More In this work, we define and solve the Fair Top-k Ranking problem, in which we want to determine a subset of k candidates from a large pool of n >> k candidates, maximizing utility (i.e., select the "best" candidates) subject to group fairness criteria. Our ranked group fairness definition extends group fairness using the standard notion of protected groups and is based on ensuring that the proportion of protected candidates in every prefix of the top-k ranking remains statistically above or indistinguishable from a given minimum. Utility is operationalized in two ways: (i) every candidate included in the top-$k$ should be more qualified than every candidate not included; and (ii) for every pair of candidates in the top-k, the more qualified candidate should be ranked above. An efficient algorithm is presented for producing the Fair Top-k Ranking, and tested experimentally on existing datasets as well as new datasets released with this paper, showing that our approach yields small distortions with respect to rankings that maximize utility without considering fairness criteria. To the best of our knowledge, this is the first algorithm grounded in statistical tests that can mitigate biases in the representation of an under-represented group along a ranked list. △ Less

Submitted 2 July, 2018; v1 submitted 20 June, 2017; originally announced June 2017.

Comments: In Proceedings of the 26th ACM International Conference on Information and Knowledge Management (CIKM'17). This version corrects an error on Table 4

ACM Class: H.3.3; J.1

arXiv:1607.01869 [pdf, other]

doi 10.1145/2911451.2911538.

Scalable Semantic Matching of Queries to Ads in Sponsored Search Advertising

Authors: Mihajlo Grbovic, Nemanja Djuric, Vladan Radosavljevic, Fabrizio Silvestri, Ricardo Baeza-Yates, Andrew Feng, Erik Ordentlich, Lee Yang, Gavin Owens

Abstract: Sponsored search represents a major source of revenue for web search engines. This popular advertising model brings a unique possibility for advertisers to target users' immediate intent communicated through a search query, usually by displaying their ads alongside organic search results for queries deemed relevant to their products or services. However, due to a large number of unique queries it… ▽ More Sponsored search represents a major source of revenue for web search engines. This popular advertising model brings a unique possibility for advertisers to target users' immediate intent communicated through a search query, usually by displaying their ads alongside organic search results for queries deemed relevant to their products or services. However, due to a large number of unique queries it is challenging for advertisers to identify all such relevant queries. For this reason search engines often provide a service of advanced matching, which automatically finds additional relevant queries for advertisers to bid on. We present a novel advanced matching approach based on the idea of semantic embeddings of queries and ads. The embeddings were learned using a large data set of user search sessions, consisting of search queries, clicked ads and search links, while utilizing contextual information such as dwell time and skipped ads. To address the large-scale nature of our problem, both in terms of data and vocabulary size, we propose a novel distributed algorithm for training of the embeddings. Finally, we present an approach for overcoming a cold-start problem associated with new ads and queries. We report results of editorial evaluation and online tests on actual search traffic. The results show that our approach significantly outperforms baselines in terms of relevance, coverage, and incremental revenue. Lastly, we open-source learned query embeddings to be used by researchers in computational advertising and related fields. △ Less

Submitted 6 July, 2016; originally announced July 2016.

Comments: 10 pages, 4 figures, 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Italy

Journal ref: 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Italy

arXiv:1604.06481 [pdf, other]

Visual Congruent Ads for Image Search

Authors: Yannis Kalantidis, Ayman Farahat, Lyndon Kennedy, Ricardo Baeza-Yates, David A. Shamma

Abstract: The quality of user experience online is affected by the relevance and placement of advertisements. We propose a new system for selecting and displaying visual advertisements in image search result sets. Our method compares the visual similarity of candidate ads to the image search results and selects the most visually similar ad to be displayed. The method further selects an appropriate location… ▽ More The quality of user experience online is affected by the relevance and placement of advertisements. We propose a new system for selecting and displaying visual advertisements in image search result sets. Our method compares the visual similarity of candidate ads to the image search results and selects the most visually similar ad to be displayed. The method further selects an appropriate location in the displayed image grid to minimize the perceptual visual differences between the ad and its neighbors. We conduct an experiment with about 900 users and find that our proposed method provides significant improvement in the users' overall satisfaction with the image search experience, without diminishing the users' ability to see the ad or recall the advertised brand. △ Less

Submitted 21 April, 2016; originally announced April 2016.

arXiv:1604.03044 [pdf, other]

doi 10.1145/2700171.2791056

Wisdom of the Crowd or Wisdom of a Few? An Analysis of Users' Content Generation

Authors: Ricardo Baeza-Yates, Diego Saez-Trumper

Abstract: In this paper we analyze how user generated content (UGC) is created, challenging the well known {\it wisdom of crowds} concept. Although it is known that user activity in most settings follow a power law, that is, few people do a lot, while most do nothing, there are few studies that characterize well this activity. In our analysis of datasets from two different social networks, Facebook and Twit… ▽ More In this paper we analyze how user generated content (UGC) is created, challenging the well known {\it wisdom of crowds} concept. Although it is known that user activity in most settings follow a power law, that is, few people do a lot, while most do nothing, there are few studies that characterize well this activity. In our analysis of datasets from two different social networks, Facebook and Twitter, we find that a small percentage of active users and much less of all users represent 50\% of the UGC. We also analyze the dynamic behavior of the generation of this content to find that the set of most active users is quite stable in time. Moreover, we study the social graph, finding that those active users are highly connected among them. This implies that most of the wisdom comes from a few users, challenging the independence assumption needed to have a wisdom of crowds. We also address the content that is never seen by any people, which we call digital desert, that challenges the assumption that the content of every person should be taken in account in a collective decision. We also compare our results with Wikipedia data and we address the quality of UGC content using an Amazon dataset. At the end our results are not surprising, as the Web is a reflection of our own society, where economical or political power also is in the hands of minorities. △ Less

Submitted 11 April, 2016; originally announced April 2016.

ACM Class: H.2.8; J.4

Journal ref: Proceedings of the 26th ACM Conference on Hypertext & Social Media, 2015

arXiv:1601.02071 [pdf, other]

Sentiment Visualisation Widgets for Exploratory Search

Authors: Eduardo Graells-Garrido, Mounia Lalmas, Ricardo Baeza-Yates

Abstract: This paper proposes the usage of \emph{visualisation widgets} for exploratory search with \emph{sentiment} as a facet. Starting from specific design goals for depiction of ambivalence in sentiment, two visualization widgets were implemented: \emph{scatter plot} and \emph{parallel coordinates}. Those widgets were evaluated against a text baseline in a small-scale usability study with exploratory ta… ▽ More This paper proposes the usage of \emph{visualisation widgets} for exploratory search with \emph{sentiment} as a facet. Starting from specific design goals for depiction of ambivalence in sentiment, two visualization widgets were implemented: \emph{scatter plot} and \emph{parallel coordinates}. Those widgets were evaluated against a text baseline in a small-scale usability study with exploratory tasks using Wikipedia as dataset. The study results indicate that users spend more time browsing with scatter plots in a positive way. A post-hoc analysis of individual differences in behavior revealed that when considering two types of users, \emph{explorers} and \emph{achievers}, engagement with scatter plots is positive and significantly greater \textit{when users are explorers}. We discuss the implications of these findings for sentiment-based exploratory search and personalised user interfaces. △ Less

Submitted 8 January, 2016; originally announced January 2016.

Comments: Presented at the Social Personalization Workshop held jointly with ACM Hypertext 2014. 6 pages

ACM Class: H.3.3; H.5.2

arXiv:1601.00481 [pdf, other]

doi 10.1145/2856767.2856776

Data Portraits and Intermediary Topics: Encouraging Exploration of Politically Diverse Profiles

Authors: Eduardo Graells-Garrido, Mounia Lalmas, Ricardo Baeza-Yates

Abstract: In micro-blogging platforms, people connect and interact with others. However, due to cognitive biases, they tend to interact with like-minded people and read agreeable information only. Many efforts to make people connect with those who think differently have not worked well. In this paper, we hypothesize, first, that previous approaches have not worked because they have been direct -- they have… ▽ More In micro-blogging platforms, people connect and interact with others. However, due to cognitive biases, they tend to interact with like-minded people and read agreeable information only. Many efforts to make people connect with those who think differently have not worked well. In this paper, we hypothesize, first, that previous approaches have not worked because they have been direct -- they have tried to explicitly connect people with those having opposing views on sensitive issues. Second, that neither recommendation or presentation of information by themselves are enough to encourage behavioral change. We propose a platform that mixes a recommender algorithm and a visualization-based user interface to explore recommendations. It recommends politically diverse profiles in terms of distance of latent topics, and displays those recommendations in a visual representation of each user's personal content. We performed an "in the wild" evaluation of this platform, and found that people explored more recommendations when using a biased algorithm instead of ours. In line with our hypothesis, we also found that the mixture of our recommender algorithm and our user interface, allowed politically interested users to exhibit an unbiased exploration of the recommended profiles. Finally, our results contribute insights in two aspects: first, which individual differences are important when designing platforms aimed at behavioral change; and second, which algorithms and user interfaces should be mixed to help users avoid cognitive mechanisms that lead to biased behavior. △ Less

Submitted 4 January, 2016; originally announced January 2016.

Comments: 12 pages, 7 figures. To be presented at ACM Intelligent User Interfaces 2016

ACM Class: H.4.3; H.5.2

arXiv:1510.01920 [pdf, other]

doi 10.1145/2856767.2856775

Encouraging Diversity- and Representation-Awareness in Geographically Centralized Content

Authors: Eduardo Graells-Garrido, Mounia Lalmas, Ricardo Baeza-Yates

Abstract: In centralized countries, not only population, media and economic power are concentrated, but people give more attention to central locations. While this is not inherently bad, this behavior extends to micro-blogging platforms: central locations get more attention in terms of information flow. In this paper we study the effects of an information filtering algorithm that decentralizes content in su… ▽ More In centralized countries, not only population, media and economic power are concentrated, but people give more attention to central locations. While this is not inherently bad, this behavior extends to micro-blogging platforms: central locations get more attention in terms of information flow. In this paper we study the effects of an information filtering algorithm that decentralizes content in such platforms. Particularly, we find that users from non-central locations were not able to identify the geographical diversity on timelines generated by the algorithm, which were diverse by construction. To make users see the inherent diversity, we define a design rationale to approach this problem, focused on an already known visualization technique: treemaps. Using interaction data from an "in the wild" deployment of our proposed system, we find that, even though there are effects of centralization in exploratory user behavior, the treemap was able to make users see the inherent geographical diversity of timelines, and engage with user generated content. With these results in mind, we propose practical actions for micro-blogging platforms to account for the differences and biased behavior induced by centralization. △ Less

Submitted 7 October, 2015; originally announced October 2015.

Comments: 12 pages. Under review. Please contact authors before citing / distributing

ACM Class: H.3.3; H.5.2

arXiv:1506.00963 [pdf, other]

Finding Intermediary Topics Between People of Opposing Views: A Case Study

Authors: Eduardo Graells-Garrido, Mounia Lalmas, Ricardo Baeza-Yates

Abstract: In micro-blogging platforms, people can connect with others and have conversations on a wide variety of topics. However, because of homophily and selective exposure, users tend to connect with like-minded people and only read agreeable information. Motivated by this scenario, in this paper we study the diversity of intermediary topics, which are latent topics estimated from user generated content.… ▽ More In micro-blogging platforms, people can connect with others and have conversations on a wide variety of topics. However, because of homophily and selective exposure, users tend to connect with like-minded people and only read agreeable information. Motivated by this scenario, in this paper we study the diversity of intermediary topics, which are latent topics estimated from user generated content. These topics can be used as features in recommender systems aimed at introducing people of diverse political viewpoints. We conducted a case study on Twitter, considering the debate about a sensitive issue in Chile, where we quantified homophilic behavior in terms of political discussion and then we evaluated the diversity of intermediary topics in terms of political stances of users. △ Less

Submitted 30 July, 2015; v1 submitted 2 June, 2015; originally announced June 2015.

Comments: 6 pages. Presented at the International Workshop on Social Personalisation & Search, SPS2015 (co-located with SIGIR 2015)

ACM Class: H.3.4

arXiv:1309.2679 [pdf]

Caracterizando la Web Chilena

Authors: Eduardo Graells-Garrido, Ricardo Baeza-Yates

Abstract: This article presents a characterization of the web space from Chile in 2007. The characterization shows distributions of sites and domains, analysis of document content and server configuration. In addition, the network structure of the chilean Web is analyzed, determining components based on hyperlink structure at the document and site levels. Original Abstract: En este artículo se muestra una… ▽ More This article presents a characterization of the web space from Chile in 2007. The characterization shows distributions of sites and domains, analysis of document content and server configuration. In addition, the network structure of the chilean Web is analyzed, determining components based on hyperlink structure at the document and site levels. Original Abstract: En este artículo se muestra una caracterización del espacio web de Chile para el año 2007. Se muestran distribuciones de sitios y dominios, caracterización del contenido en base a tipos de documento, asi como configuración de los servidores. Se estudia la estructura de la red creada mediante hipervínculos en los documentos y cómo las diferentes componentes de esta estructura varían cuando los hipervínculos son agregados a nivel de sitios. △ Less

Submitted 10 September, 2013; originally announced September 2013.

Comments: In Spanish. Published in "Revista Bits de Ciencia" vol. 2, 2009. Department of Computer Science, University of Chile. Available in http://www.dcc.uchile.cl/revista

arXiv:1309.1890 [pdf, other]

doi 10.1109/LA-WEB.2008.19

Evolution of the Chilean Web: A Larger Study

Authors: Eduardo Graells-Garrido, Ricardo Baeza-Yates

Abstract: In this paper we extend our previous and only study on the dynamics of the Chilean Web. This new study doubles the time period and to the best of our knowledge is the only study of its type known about any country in the Web. The new results corroborate the trends found before, in particular the exponential growth of the Web, and reinforce the conclusion that the Web is more chaotic than we would… ▽ More In this paper we extend our previous and only study on the dynamics of the Chilean Web. This new study doubles the time period and to the best of our knowledge is the only study of its type known about any country in the Web. The new results corroborate the trends found before, in particular the exponential growth of the Web, and reinforce the conclusion that the Web is more chaotic than we would like. Hence, modeling most Web characteristics is not trivial. △ Less

Submitted 7 September, 2013; originally announced September 2013.

Comments: Presented at the Sixth Latin American Web Congress, 2008, Vila Velha, Espírito Santo, Brazil

arXiv:1204.2712 [pdf, ps, other]

Learning to Rank Query Recommendations by Semantic Similarities

Authors: Sumio Fujita, Georges Dupret, Ricardo Baeza-Yates

Abstract: Logs of the interactions with a search engine show that users often reformulate their queries. Examining these reformulations shows that recommendations that precise the focus of a query are helpful, like those based on expansions of the original queries. But it also shows that queries that express some topical shift with respect to the original query can help user access more rapidly the informat… ▽ More Logs of the interactions with a search engine show that users often reformulate their queries. Examining these reformulations shows that recommendations that precise the focus of a query are helpful, like those based on expansions of the original queries. But it also shows that queries that express some topical shift with respect to the original query can help user access more rapidly the information they need. We propose a method to identify from the query logs of past users queries that either focus or shift the initial query topic. This method combines various click-based, topic-based and session based ranking strategies and uses supervised learning in order to maximize the semantic similarities between the query and the recommendations, while at the same diversifying them. We evaluate our method using the query/click logs of a Japanese web search engine and we show that the combination of the three methods proposed is significantly better than any of them taken individually. △ Less

Submitted 12 April, 2012; originally announced April 2012.

Comments: 2nd International Workshop on Usage Analysis and the Web of Data (USEWOD2012) in the 21st International World Wide Web Conference (WWW2012), Lyon, France, April 17th, 2012

Report number: WWW2012USEWOD/2012/fuduba ACM Class: H.3.3; H.3.5

arXiv:1006.5059 [pdf, ps, other]

Capacity Planning for Vertical Search Engines

Authors: Claudine Badue, Jussara Almeida, Virgilio Almeida, Ricardo Baeza-Yates, Berthier Ribeiro-Neto, Artur Ziviani, Nivio Ziviani

Abstract: Vertical search engines focus on specific slices of content, such as the Web of a single country or the document collection of a large corporation. Despite this, like general open web search engines, they are expensive to maintain, expensive to operate, and hard to design. Because of this, predicting the response time of a vertical search engine is usually done empirically through experimentation,… ▽ More Vertical search engines focus on specific slices of content, such as the Web of a single country or the document collection of a large corporation. Despite this, like general open web search engines, they are expensive to maintain, expensive to operate, and hard to design. Because of this, predicting the response time of a vertical search engine is usually done empirically through experimentation, requiring a costly setup. An alternative is to develop a model of the search engine for predicting performance. However, this alternative is of interest only if its predictions are accurate. In this paper we propose a methodology for analyzing the performance of vertical search engines. Applying the proposed methodology, we present a capacity planning model based on a queueing network for search engines with a scale typically suitable for the needs of large corporations. The model is simple and yet reasonably accurate and, in contrast to previous work, considers the imbalance in query service times among homogeneous index servers. We discuss how we tune up the model and how we apply it to predict the impact on the query response time when parameters such as CPU and disk capacities are changed. This allows a manager of a vertical search engine to determine a priori whether a new configuration of the system might keep the query response under specified performance constraints. △ Less

Submitted 25 June, 2010; originally announced June 2010.

Showing 1–26 of 26 results for author: Baeza-Yates, R