Skip to main content

Showing 1–25 of 25 results for author: Thirumuruganathan, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2304.12573  [pdf, other

    cs.LG cs.CY cs.DB

    Fairness and Bias in Truth Discovery Algorithms: An Experimental Analysis

    Authors: Simone Lazier, Saravanan Thirumuruganathan, Hadis Anahideh

    Abstract: Machine learning (ML) based approaches are increasingly being used in a number of applications with societal impact. Training ML models often require vast amounts of labeled data, and crowdsourcing is a dominant paradigm for obtaining labels from multiple workers. Crowd workers may sometimes provide unreliable labels, and to address this, truth discovery (TD) algorithms such as majority voting are… ▽ More

    Submitted 25 April, 2023; originally announced April 2023.

    Comments: Accepted in Algorithmic Fairness in Artificial intelligence, Machine learning and Decision Making workshop at SDM 2023

  2. arXiv:2006.13025   

    cs.LG stat.ML

    Fair Active Learning

    Authors: Hadis Anahideh, Abolfazl Asudeh, Saravanan Thirumuruganathan

    Abstract: Machine learning (ML) is increasingly being used in high-stakes applications impacting society. Therefore, it is of critical importance that ML models do not propagate discrimination. Collecting accurate labeled data in societal applications is challenging and costly. Active learning is a promising approach to build an accurate classifier by interactively querying an oracle within a labeling budge… ▽ More

    Submitted 1 July, 2020; v1 submitted 20 June, 2020; originally announced June 2020.

    Comments: This was intended as a replacement of arXiv:2001.01796 please see the updated version there

  3. arXiv:2001.01796  [pdf, other

    cs.LG stat.ML

    Fair Active Learning

    Authors: Hadis Anahideh, Abolfazl Asudeh, Saravanan Thirumuruganathan

    Abstract: Machine learning (ML) is increasingly being used in high-stakes applications impacting society. Therefore, it is of critical importance that ML models do not propagate discrimination. Collecting accurate labeled data in societal applications is challenging and costly. Active learning is a promising approach to build an accurate classifier by interactively querying an oracle within a labeling budge… ▽ More

    Submitted 31 March, 2021; v1 submitted 6 January, 2020; originally announced January 2020.

  4. arXiv:1909.01120  [pdf, other

    cs.DB cs.CL cs.LG

    Local Embeddings for Relational Data Integration

    Authors: Riccardo Cappuzzo, Paolo Papotti, Saravanan Thirumuruganathan

    Abstract: Deep learning based techniques have been recently used with promising results for data integration problems. Some methods directly use pre-trained embeddings that were trained on a large corpus such as Wikipedia. However, they may not always be an appropriate choice for enterprise datasets with custom vocabulary. Other methods adapt techniques from natural language processing to obtain embeddings… ▽ More

    Submitted 3 September, 2020; v1 submitted 3 September, 2019; originally announced September 2019.

    Comments: Accepted to SIGMOD 2020 as Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. Code can be found at https://gitlab.eurecom.fr/cappuzzo/embdi

  5. ZeroER: Entity Resolution using Zero Labeled Examples

    Authors: Renzhi Wu, Sanya Chaba, Saurabh Sawlani, Xu Chu, Saravanan Thirumuruganathan

    Abstract: Entity resolution (ER) refers to the problem of matching records in one or more relations that refer to the same real-world entity. While supervised machine learning (ML) approaches achieve the state-of-the-art results, they require a large amount of labeled examples that are expensive to obtain and often times infeasible. We investigate an important problem that vexes practitioners: is it possibl… ▽ More

    Submitted 6 April, 2020; v1 submitted 16 August, 2019; originally announced August 2019.

    Comments: Published at 2020 ACM SIGMOD International Conference on Management of Data

  6. arXiv:1907.13276  [pdf, other

    cs.LG stat.ML

    Are Outlier Detection Methods Resilient to Sampling?

    Authors: Laure Berti-Equille, Ji Meng Loh, Saravanan Thirumuruganathan

    Abstract: Outlier detection is a fundamental task in data mining and has many applications including detecting errors in databases. While there has been extensive prior work on methods for outlier detection, modern datasets often have sizes that are beyond the ability of commonly used methods to process the data within a reasonable time. To overcome this issue, outlier detection methods can be trained over… ▽ More

    Submitted 30 July, 2019; originally announced July 2019.

    Comments: 18 pages

  7. arXiv:1903.10000  [pdf, other

    cs.DB cs.LG

    Approximate Query Processing using Deep Generative Models

    Authors: Saravanan Thirumuruganathan, Shohedul Hasan, Nick Koudas, Gautam Das

    Abstract: Data is generated at an unprecedented rate surpassing our ability to analyze them. The database community has pioneered many novel techniques for Approximate Query Processing (AQP) that could give approximate results in a fraction of time needed for computing exact results. In this work, we explore the usage of deep learning (DL) for answering aggregate queries specifically for interactive applica… ▽ More

    Submitted 18 November, 2019; v1 submitted 24 March, 2019; originally announced March 2019.

    Comments: Accepted to ICDE 2020 as "Approximate Query Processing for Data Exploration using Deep Generative Models"

  8. arXiv:1903.09999  [pdf, other

    cs.DB cs.LG

    Multi-Attribute Selectivity Estimation Using Deep Learning

    Authors: Shohedul Hasan, Saravanan Thirumuruganathan, Jees Augustine, Nick Koudas, Gautam Das

    Abstract: Selectivity estimation - the problem of estimating the result size of queries - is a fundamental problem in databases. Accurate estimation of query selectivity involving multiple correlated attributes is especially challenging. Poor cardinality estimates could result in the selection of bad plans by the query optimizer. We investigate the feasibility of using deep learning based approaches for bot… ▽ More

    Submitted 17 June, 2019; v1 submitted 24 March, 2019; originally announced March 2019.

  9. arXiv:1809.11084  [pdf, other

    cs.DB cs.LG stat.ML

    Reuse and Adaptation for Entity Resolution through Transfer Learning

    Authors: Saravanan Thirumuruganathan, Shameem A Puthiya Parambath, Mourad Ouzzani, Nan Tang, Shafiq Joty

    Abstract: Entity resolution (ER) is one of the fundamental problems in data integration, where machine learning (ML) based classifiers often provide the state-of-the-art results. Considerable human effort goes into feature engineering and training data creation. In this paper, we investigate a new problem: Given a dataset D_T for ER with limited or no training data, is it possible to train a good ML classif… ▽ More

    Submitted 28 September, 2018; originally announced September 2018.

  10. arXiv:1803.01384  [pdf, other

    cs.DB

    Data Curation with Deep Learning [Vision]

    Authors: Saravanan Thirumuruganathan, Nan Tang, Mourad Ouzzani, AnHai Doan

    Abstract: Data curation - the process of discovering, integrating, and cleaning data - is one of the oldest, hardest, yet inevitable data management problems. Despite decades of efforts from both researchers and practitioners, it is still one of the most time consuming and least enjoyable work of data scientists. In most organizations, data curation plays an important role so as to fully unlock the value of… ▽ More

    Submitted 24 March, 2019; v1 submitted 4 March, 2018; originally announced March 2018.

  11. arXiv:1802.02351  [pdf, other

    cs.OH

    Road Network Fusion for Incremental Map Updates

    Authors: Rade Stanojevic, Sofiane Abbar, Saravanan Thirumuruganathan, Gianmarco De Francisci Morales, Sanjay Chawla, Fethi Filali, Ahid Aleimat

    Abstract: In the recent years a number of novel, automatic map-inference techniques have been proposed, which derive road-network from a cohort of GPS traces collected by a fleet of vehicles. In spite of considerable attention, these maps are imperfect in many ways: they create an abundance of spurious connections, have poor coverage, and are visually confusing. Hence, commercial and crowd-sourced map** s… ▽ More

    Submitted 7 February, 2018; originally announced February 2018.

    Journal ref: In the special volume of Springer's Lecture Notes in Cartography and Geoinformation (LBS 2018.)

  12. DeepER -- Deep Entity Resolution

    Authors: Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, Nan Tang

    Abstract: Entity resolution (ER) is a key data integration problem. Despite the efforts in 70+ years in all aspects of ER, there is still a high demand for democratizing ER - humans are heavily involved in labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With the recent advances in deep learning, in particular distributed representation of words (a.k.a. word… ▽ More

    Submitted 18 November, 2019; v1 submitted 2 October, 2017; originally announced October 2017.

    Comments: Accepted to PVLDB 2018 as "Distributed Representations of Tuples for Entity Resolution". This version corrects a minor issue in Example 4 pointed out by Andrew Borthwick and Matthias Boehm

  13. Malware in the Future? Forecasting of Analyst Detection of Cyber Events

    Authors: Jonathan Z. Bakdash, Steve Hutchinson, Erin G. Zaroukian, Laura R. Marusich, Saravanan Thirumuruganathan, Charmaine Sample, Blaine Hoffman, Gautam Das

    Abstract: There have been extensive efforts in government, academia, and industry to anticipate, forecast, and mitigate cyber attacks. A common approach is time-series forecasting of cyber attacks based on data from network telescopes, honeypots, and automated intrusion detection/prevention systems. This research has uncovered key insights such as systematicity in cyber attacks. Here, we propose an alternat… ▽ More

    Submitted 8 June, 2018; v1 submitted 11 July, 2017; originally announced July 2017.

    Comments: Revised version resubmitted to journal

  14. A Cost-based Optimizer for Gradient Descent Optimization

    Authors: Zoi Kaoudi, Jorge-Arnulfo Quiané-Ruiz, Saravanan Thirumuruganathan, Sanjay Chawla, Divy Agrawal

    Abstract: As the use of machine learning (ML) permeates into diverse application domains, there is an urgent need to support a declarative framework for ML. Ideally, a user will specify an ML task in a high-level and easy-to-use language and the framework will invoke the appropriate algorithms and system configurations to execute it. An important observation towards designing such a framework is that many M… ▽ More

    Submitted 27 March, 2017; originally announced March 2017.

    Comments: Accepted at SIGMOD 2017

  15. arXiv:1702.06025  [pdf, other

    cs.OH

    Kharita: Robust Map Inference using Graph Spanners

    Authors: Rade Stanojevic, Sofiane Abbar, Saravanan Thirumuruganathan, Sanjay Chawla, Fethi Filali, Ahid Aleimat

    Abstract: The widespread availability of GPS information in everyday devices such as cars, smartphones and smart watches make it possible to collect large amount of geospatial trajectory information. A particularly important, yet technically challenging, application of this data is to identify the underlying road network and keep it updated under various changes. In this paper, we propose efficient algorith… ▽ More

    Submitted 20 February, 2017; originally announced February 2017.

  16. arXiv:1602.03730  [pdf, other

    cs.DB

    HDBSCAN: Density based Clustering over Location Based Services

    Authors: Md Farhadur Rahman, Weimo Liu, Saad Bin Suhaim, Saravanan Thirumuruganathan, Nan Zhang, Gautam Das

    Abstract: Location Based Services (LBS) have become extremely popular and used by millions of users. Popular LBS run the entire gamut from map** services (such as Google Maps) to restaurants (such as Yelp) and real-estate (such as Redfin). The public query interfaces of LBS can be abstractly modeled as a kNN interface over a database of two dimensional points: given an arbitrary query point, the system re… ▽ More

    Submitted 16 February, 2016; v1 submitted 11 February, 2016; originally announced February 2016.

  17. Discovering the Skyline of Web Databases

    Authors: Abolfazl Asudeh, Saravanan Thirumuruganathan, Nan Zhang, Gautam Das

    Abstract: Many web databases are "hidden" behind proprietary search interfaces that enforce the top-$k$ output constraint, i.e., each query returns at most $k$ of all matching tuples, preferentially selected and returned according to a proprietary ranking function. In this paper, we initiate research into the novel problem of skyline discovery over top-$k$ hidden web databases. Since skyline tuples provide… ▽ More

    Submitted 20 March, 2016; v1 submitted 7 December, 2015; originally announced December 2015.

    Journal ref: The Proceedings of the VLDB Endowment (PVLDB) 2016, Vol 9, No. 7, pages 600 - 611

  18. arXiv:1505.02441  [pdf, other

    cs.DB

    Aggregate Estimations over Location Based Services

    Authors: Weimo Liu, Md Farhadur Rahman, Saravanan Thirumuruganathan, Nan Zhang, Gautam Das

    Abstract: Location based services (LBS) have become very popular in recent years. They range from map services (e.g., Google Maps) that store geographic locations of points of interests, to online social networks (e.g., WeChat, Sina Weibo, FourSquare) that leverage user geographic locations to enable various recommendation functions. The public query interfaces of these services may be abstractly modeled as… ▽ More

    Submitted 13 May, 2015; v1 submitted 10 May, 2015; originally announced May 2015.

  19. arXiv:1502.05106  [pdf, other

    cs.DB

    "The Whole Is Greater Than the Sum of Its Parts": Optimization in Collaborative Crowdsourcing

    Authors: Habibur Rahman, Senjuti Basu Roy, Saravanan Thirumuruganathan, Sihem Amer-Yahia, Gautam Das

    Abstract: In this work, we initiate the investigation of optimization opportunities in collaborative crowdsourcing. Many popular applications, such as collaborative document editing, sentence translation, or citizen science resort to this special form of human-based computing, where, crowd workers with appropriate skills and expertise are required to form groups to solve complex tasks. Central to any collab… ▽ More

    Submitted 12 April, 2015; v1 submitted 17 February, 2015; originally announced February 2015.

  20. arXiv:1411.1455  [pdf, other

    cs.DB

    Rank-Based Inference over Web Databases

    Authors: Md Farhadur Rahman, Weimo Liu, Saravanan Thirumuruganathan, Nan Zhang, Gautam Das

    Abstract: In recent years, there has been much research in Ranked Retrieval model in structured databases, especially those in web databases. With this model, a search query returns top-k tuples according to not just exact matches of selection conditions, but a suitable ranking function. This paper studies a novel problem on the privacy implications of database ranking. The motivation is a novel yet serious… ▽ More

    Submitted 5 April, 2015; v1 submitted 5 November, 2014; originally announced November 2014.

  21. arXiv:1410.7833  [pdf, other

    cs.SI physics.soc-ph

    Walk, Not Wait: Faster Sampling Over Online Social Networks

    Authors: Azade Nazi, Zhuojie Zhou, Saravanan Thirumuruganathan, Nan Zhang, Gautam Das

    Abstract: In this paper, we introduce a novel, general purpose, technique for faster sampling of nodes over an online social network. Specifically, unlike traditional random walk which wait for the convergence of sampling distribution to a predetermined target distribution - a waiting process that incurs a high query cost - we develop WALK-ESTIMATE, which starts with a much shorter random walk, and then pro… ▽ More

    Submitted 1 November, 2014; v1 submitted 28 October, 2014; originally announced October 2014.

  22. arXiv:1403.2763  [pdf, other

    cs.DB

    Aggregate Estimation Over Dynamic Hidden Web Databases

    Authors: Weimo Liu, Saravanan Thirumuruganathan, Nan Zhang, Gautam Das

    Abstract: Many databases on the web are "hidden" behind (i.e., accessible only through) their restrictive, form-like, search interfaces. Recent studies have shown that it is possible to estimate aggregate query answers over such hidden web databases by issuing a small number of carefully designed search queries through the restrictive web interface. A problem with these existing work, however, is that they… ▽ More

    Submitted 1 May, 2014; v1 submitted 11 March, 2014; originally announced March 2014.

  23. arXiv:1401.1302  [pdf, other

    cs.DB cs.SI

    Optimization in Knowledge-Intensive Crowdsourcing

    Authors: Senjuti Basu Roy, Ioanna Lykourentzou, Saravanan Thirumuruganathan, Sihem Amer-Yahia, Gautam Das

    Abstract: We present SmartCrowd, a framework for optimizing collaborative knowledge-intensive crowdsourcing. SmartCrowd distinguishes itself by accounting for human factors in the process of assigning tasks to workers. Human factors designate workers' expertise in different skills, their expected minimum wage, and their availability. In SmartCrowd, we formulate task assignment as an optimization problem, an… ▽ More

    Submitted 7 January, 2014; originally announced January 2014.

    Comments: 12 pages

  24. arXiv:1208.3876  [pdf, ps, other

    cs.DB

    Digging Deeper into Deep Web Databases by Breaking Through the Top-k Barrier

    Authors: Saravanan Thirumuruganathan, Nan Zhang, Gautam Das

    Abstract: A large number of web databases are only accessible through proprietary form-like interfaces which require users to query the system by entering desired values for a few attributes. A key restriction enforced by such an interface is the top-k output constraint - i.e., when there are a large number of matching tuples, only a few (top-k) of them are preferentially selected and returned by the websit… ▽ More

    Submitted 19 August, 2012; originally announced August 2012.

    Comments: 12 pages, 11 figures

  25. arXiv:1208.0285  [pdf, other

    cs.DB

    Who Tags What? An Analysis Framework

    Authors: Mahashweta Das, Saravanan Thirumuruganathan, Sihem Amer-Yahia, Gautam Das, Cong Yu

    Abstract: The rise of Web 2.0 is signaled by sites such as Flickr, del.icio.us, and YouTube, and social tagging is essential to their success. A typical tagging action involves three components, user, item (e.g., photos in Flickr), and tags (i.e., words or phrases). Analyzing how tags are assigned by certain users to certain items has important implications in hel** users search for desired information. I… ▽ More

    Submitted 1 August, 2012; originally announced August 2012.

    Comments: VLDB2012

    Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 11, pp. 1567-1578 (2012)