Skip to main content

Showing 1–41 of 41 results for author: Kejriwal, M

.
  1. arXiv:2407.01892  [pdf, other

    cs.AI cs.CL

    GRASP: A Grid-Based Benchmark for Evaluating Commonsense Spatial Reasoning

    Authors: Zhisheng Tang, Mayank Kejriwal

    Abstract: Spatial reasoning, an important faculty of human cognition with many practical applications, is one of the core commonsense skills that is not purely language-based and, for satisfying (as opposed to optimal) solutions, requires some minimum degree of planning. Existing benchmarks of Commonsense Spatial Reasoning (CSR) tend to evaluate how Large Language Models (LLMs) interpret text-based spatial… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

  2. arXiv:2406.12216  [pdf, other

    cs.CL cs.AI

    Is persona enough for personality? Using ChatGPT to reconstruct an agent's latent personality from simple descriptions

    Authors: Yongyi Ji, Zhisheng Tang, Mayank Kejriwal

    Abstract: Personality, a fundamental aspect of human cognition, contains a range of traits that influence behaviors, thoughts, and emotions. This paper explores the capabilities of large language models (LLMs) in reconstructing these complex cognitive attributes based only on simple descriptions containing socio-demographic and personality type information. Utilizing the HEXACO personality framework, our st… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Accepted to the ICML 2024 Workshop on Large Language Models and Cognition

  3. arXiv:2405.15185  [pdf, other

    cs.CL cs.AI cs.HC

    An Evaluation of Estimative Uncertainty in Large Language Models

    Authors: Zhisheng Tang, Ke Shen, Mayank Kejriwal

    Abstract: Words of estimative probability (WEPs), such as ''maybe'' or ''probably not'' are ubiquitous in natural language for communicating estimative uncertainty, compared with direct statements involving numerical probability. Human estimative uncertainty, and its calibration with numerical estimates, has long been an area of study -- including by intelligence agencies like the CIA. This study compares e… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  4. arXiv:2312.13487  [pdf, other

    cs.AI

    Understanding and Estimating Domain Complexity Across Domains

    Authors: Katarina Doctor, Mayank Kejriwal, Lawrence Holder, Eric Kildebeck, Emma Resmini, Christopher Pereyda, Robert J. Steininger, Daniel V. Olivença

    Abstract: Artificial Intelligence (AI) systems, trained in controlled environments, often struggle in real-world complexities. We propose a general framework for estimating domain complexity across diverse environments, like open-world learning and real-world applications. This framework distinguishes between intrinsic complexity (inherent to the domain) and extrinsic complexity (dependent on the AI agent).… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

    Comments: 34 pages, 13 figures, 7 tables. arXiv admin note: substantial text overlap with arXiv:2303.04141

  5. arXiv:2312.05209  [pdf, other

    cs.AI cs.CL

    HALO: An Ontology for Representing and Categorizing Hallucinations in Large Language Models

    Authors: Navapat Nananukul, Mayank Kejriwal

    Abstract: Recent progress in generative AI, including large language models (LLMs) like ChatGPT, has opened up significant opportunities in fields ranging from natural language processing to knowledge discovery and data mining. However, there is also a growing awareness that the models can be prone to problems such as making information up or `hallucinations', and faulty reasoning on seemingly simple proble… ▽ More

    Submitted 2 April, 2024; v1 submitted 8 December, 2023; originally announced December 2023.

    Comments: This paper has been accepted and orally presented in "SPIE Defense + Commercial Sensing (DCS 2024)" in National Harbor, Maryland, April 2024

  6. arXiv:2310.06174  [pdf, other

    cs.AI cs.SE

    Cost-Efficient Prompt Engineering for Unsupervised Entity Resolution

    Authors: Navapat Nananukul, Khanin Sisaengsuwanchai, Mayank Kejriwal

    Abstract: Entity Resolution (ER) is the problem of semi-automatically determining when two entities refer to the same underlying entity, with applications ranging from healthcare to e-commerce. Traditional ER solutions required considerable manual expertise, including domain-specific feature engineering, as well as identification and curation of training data. Recently released large language models (LLMs)… ▽ More

    Submitted 6 April, 2024; v1 submitted 9 October, 2023; originally announced October 2023.

  7. arXiv:2310.05258  [pdf, other

    cs.AI cs.DB cs.IR

    A Knowledge Graph-Based Search Engine for Robustly Finding Doctors and Locations in the Healthcare Domain

    Authors: Mayank Kejriwal, Hamid Haidarian, Min-Hsueh Chiu, Andy Xiang, Deep Shrestha, Faizan Javed

    Abstract: Efficiently finding doctors and locations is an important search problem for patients in the healthcare domain, for which traditional information retrieval methods tend not to work optimally. In the last ten years, knowledge graphs (KGs) have emerged as a powerful way to combine the benefits of gleaning insights from semi-structured data using semantic modeling, natural language processing techniq… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

    Comments: Presented as an applied data science poster in KDD 2023

  8. arXiv:2310.03283  [pdf, other

    cs.CL

    A Formalism and Approach for Improving Robustness of Large Language Models Using Risk-Adjusted Confidence Scores

    Authors: Ke Shen, Mayank Kejriwal

    Abstract: Large Language Models (LLMs), such as ChatGPT, have achieved impressive milestones in natural language processing (NLP). Despite their impressive performance, the models are known to pose important risks. As these models are deployed in real-world applications, a systematic understanding of different risks posed by these models on tasks such as natural language inference (NLI), is much needed. In… ▽ More

    Submitted 4 October, 2023; originally announced October 2023.

  9. arXiv:2307.12173  [pdf, other

    cs.AI cs.DB

    Named Entity Resolution in Personal Knowledge Graphs

    Authors: Mayank Kejriwal

    Abstract: Entity Resolution (ER) is the problem of determining when two entities refer to the same underlying entity. The problem has been studied for over 50 years, and most recently, has taken on new importance in an era of large, heterogeneous 'knowledge graphs' published on the Web and used widely in domains as wide ranging as social media, e-commerce and search. This chapter will discuss the specific p… ▽ More

    Submitted 22 July, 2023; originally announced July 2023.

    Comments: To appear as a book chapter by the same name in an upcoming (Oct. 2023) book `Personal Knowledge Graphs (PKGs): Methodology, tools and applications' edited by Tiwari et al

  10. arXiv:2307.07920  [pdf, other

    cs.SI

    A structural study of Big Tech firm-switching of inventors in the post-recession era

    Authors: Yidan Sun, Mayank Kejriwal

    Abstract: Complex systems research and network science have recently been used to provide novel insights into economic phenomena such as patenting behavior and innovation in firms. Several studies have found that increased mobility of inventors, manifested through firm switching or transitioning, is associated with increased overall productivity. This paper proposes a novel structural study of such transiti… ▽ More

    Submitted 15 July, 2023; originally announced July 2023.

  11. arXiv:2303.04141  [pdf, other

    cs.AI

    Toward Defining a Domain Complexity Measure Across Domains

    Authors: Katarina Doctor, Christine Task, Eric Kildebeck, Mayank Kejriwal, Lawrence Holder, Russell Leong

    Abstract: Artificial Intelligence (AI) systems planned for deployment in real-world applications frequently are researched and developed in closed simulation environments where all variables are controlled and known to the simulator or labeled benchmark datasets are used. Transition from these simulators, testbeds, and benchmark datasets to more open-world domains poses significant challenges to AI systems,… ▽ More

    Submitted 7 March, 2023; originally announced March 2023.

  12. arXiv:2302.09068  [pdf

    cs.AI cs.CL cs.CV

    A Pilot Evaluation of ChatGPT and DALL-E 2 on Decision Making and Spatial Reasoning

    Authors: Zhisheng Tang, Mayank Kejriwal

    Abstract: We conduct a pilot study selectively evaluating the cognitive abilities (decision making and spatial reasoning) of two recently released generative transformer models, ChatGPT and DALL-E 2. Input prompts were constructed following neutral a priori guidelines, rather than adversarial intent. Post hoc qualitative analysis of the outputs shows that DALL-E 2 is able to generate at least one correct im… ▽ More

    Submitted 15 February, 2023; originally announced February 2023.

  13. arXiv:2211.13117  [pdf

    econ.GN cs.SI

    On the Empirical Association between Trade Network Complexity and Global Gross Domestic Product

    Authors: Mayank Kejriwal, Yuesheng Luo

    Abstract: In recent decades, trade between nations has constituted an important component of global Gross Domestic Product (GDP), with official estimates showing that it likely accounted for a quarter of total global production. While evidence of association already exists in macro-economic data between trade volume and GDP growth, there is considerably less work on whether, at the level of individual granu… ▽ More

    Submitted 18 November, 2022; originally announced November 2022.

    Comments: Peer-reviewed and presented at The 11th International Conference on Complex Networks and their Applications (2022)

  14. arXiv:2210.07519  [pdf, other

    cs.CL cs.AI

    Can Language Representation Models Think in Bets?

    Authors: Zhisheng Tang, Mayank Kejriwal

    Abstract: In recent years, transformer-based language representation models (LRMs) have achieved state-of-the-art results on difficult natural language understanding problems, such as question answering and text summarization. As these models are integrated into real-world applications, evaluating their ability to make rational decisions is an important research agenda, with practical ramifications. This ar… ▽ More

    Submitted 14 October, 2022; originally announced October 2022.

  15. arXiv:2210.01263  [pdf, other

    cs.CL

    Understanding Substructures in Commonsense Relations in ConceptNet

    Authors: Ke Shen, Mayank Kejriwal

    Abstract: Acquiring commonsense knowledge and reasoning is an important goal in modern NLP research. Despite much progress, there is still a lack of understanding (especially at scale) of the nature of commonsense knowledge itself. A potential source of structured commonsense knowledge that could be used to derive insights is ConceptNet. In particular, ConceptNet contains several coarse-grained relations, i… ▽ More

    Submitted 3 October, 2022; originally announced October 2022.

    Comments: arXiv admin note: substantial text overlap with arXiv:2011.14084

  16. arXiv:2210.01258  [pdf, other

    cs.CL cs.AI

    Understanding Prior Bias and Choice Paralysis in Transformer-based Language Representation Models through Four Experimental Probes

    Authors: Ke Shen, Mayank Kejriwal

    Abstract: Recent work on transformer-based neural networks has led to impressive advances on multiple-choice natural language understanding (NLU) problems, such as Question Answering (QA) and abductive reasoning. Despite these advances, there is limited work still on understanding whether these models respond to perturbed multiple-choice instances in a sufficiently robust manner that would allow them to be… ▽ More

    Submitted 3 October, 2022; originally announced October 2022.

  17. arXiv:2204.05872  [pdf, other

    cs.CY

    Robust Quantification of Gender Disparity in Pre-Modern English Literature using Natural Language Processing

    Authors: Akarsh Nagaraj, Mayank Kejriwal

    Abstract: Research has continued to shed light on the extent and significance of gender disparity in social, cultural and economic spheres. More recently, computational tools from the Natural Language Processing (NLP) literature have been proposed for measuring such disparity using relatively extensive datasets and empirically rigorous methodologies. In this paper, we contribute to this line of research by… ▽ More

    Submitted 12 April, 2022; originally announced April 2022.

  18. arXiv:2203.12184  [pdf, other

    cs.CL

    A Theoretically Grounded Benchmark for Evaluating Machine Commonsense

    Authors: Henrique Santos, Ke Shen, Alice M. Mulvehill, Yasaman Razeghi, Deborah L. McGuinness, Mayank Kejriwal

    Abstract: Programming machines with commonsense reasoning (CSR) abilities is a longstanding challenge in the Artificial Intelligence community. Current CSR benchmarks use multiple-choice (and in relatively fewer cases, generative) question-answering instances to evaluate machine commonsense. Recent progress in transformer-based language representation models suggest that considerable progress has been made… ▽ More

    Submitted 14 July, 2022; v1 submitted 23 March, 2022; originally announced March 2022.

  19. arXiv:2203.06491  [pdf, other

    cs.SI physics.soc-ph

    Can Scale-free Network Growth with Triad Formation Capture Simplicial Complex Distributions in Real Communication Networks?

    Authors: Mayank Kejriwal, Ke Shen

    Abstract: In recent years, there has been a growing recognition that higher-order structures are important features in real-world networks. A particular class of structures that has gained prominence is known as a simplicial complex. Despite their application to complex processes such as social contagion and novel measures of centrality, not much is currently understood about the distributional properties o… ▽ More

    Submitted 12 March, 2022; originally announced March 2022.

    Comments: 4 pages, 2 figures

  20. arXiv:2111.05823  [pdf

    cs.SI cs.CL

    Understanding COVID-19 Vaccine Reaction through Comparative Analysis on Twitter

    Authors: Yuesheng Luo, Mayank Kejriwal

    Abstract: Although multiple COVID-19 vaccines have been available for several months now, vaccine hesitancy continues to be at high levels in the United States. In part, the issue has also become politicized, especially since the presidential election in November. Understanding vaccine hesitancy during this period in the context of social media, including Twitter, can provide valuable guidance both to compu… ▽ More

    Submitted 10 November, 2021; originally announced November 2021.

    Comments: 20 pages, accepted in the 2022 Computing Conference

  21. arXiv:2108.01699  [pdf, ps, other

    cs.SI cs.AI cs.CY cs.LG

    Predicting Zip Code-Level Vaccine Hesitancy in US Metropolitan Areas Using Machine Learning Models on Public Tweets

    Authors: Sara Melotte, Mayank Kejriwal

    Abstract: Although the recent rise and uptake of COVID-19 vaccines in the United States has been encouraging, there continues to be significant vaccine hesitancy in various geographic and demographic clusters of the adult population. Surveys, such as the one conducted by Gallup over the past year, can be useful in determining vaccine hesitancy, but can be expensive to conduct and do not provide real-time da… ▽ More

    Submitted 3 August, 2021; originally announced August 2021.

    Comments: 15 pages, 4 tables, currently under review at PLOS Digital Health

  22. arXiv:2103.00683  [pdf, other

    cs.LG cs.AI

    Decision Making in Monopoly using a Hybrid Deep Reinforcement Learning Approach

    Authors: Trevor Bonjour, Marina Haliem, Aala Alsalem, Shilpa Thomas, Hongyu Li, Vaneet Aggarwal, Mayank Kejriwal, Bharat Bhargava

    Abstract: Learning to adapt and make real-time informed decisions in a dynamic and complex environment is a challenging problem. Monopoly is a popular strategic board game that requires players to make multiple decisions during the game. Decision-making in Monopoly involves many real-world elements such as strategizing, luck, and modeling of opponent's policies. In this paper, we present novel representatio… ▽ More

    Submitted 6 April, 2022; v1 submitted 28 February, 2021; originally announced March 2021.

    Comments: accepted in IEEE TETCI

  23. arXiv:2011.14084  [pdf, other

    cs.AI cs.CL

    A Data-Driven Study of Commonsense Knowledge using the ConceptNet Knowledge Base

    Authors: Ke Shen, Mayank Kejriwal

    Abstract: Acquiring commonsense knowledge and reasoning is recognized as an important frontier in achieving general Artificial Intelligence (AI). Recent research in the Natural Language Processing (NLP) community has demonstrated significant progress in this problem setting. Despite this progress, which is mainly on multiple-choice question answering tasks in limited settings, there is still a lack of under… ▽ More

    Submitted 19 January, 2021; v1 submitted 28 November, 2020; originally announced November 2020.

  24. arXiv:2011.09159  [pdf, other

    cs.CL cs.AI

    Do Fine-tuned Commonsense Language Models Really Generalize?

    Authors: Mayank Kejriwal, Ke Shen

    Abstract: Recently, transformer-based methods such as RoBERTa and GPT-3 have led to significant experimental advances in natural language processing tasks such as question answering and commonsense reasoning. The latter is typically evaluated through multiple benchmarks framed as multiple-choice instances of the former. According to influential leaderboards hosted by the Allen Institute (evaluating state-of… ▽ More

    Submitted 18 November, 2020; originally announced November 2020.

    Comments: 9 pages, 2 figures

    ACM Class: I.2.7

  25. arXiv:2007.15066  [pdf, other

    cs.CL

    An Experimental Study of The Effects of Position Bias on Emotion CauseExtraction

    Authors: Jiayuan Ding, Mayank Kejriwal

    Abstract: Emotion Cause Extraction (ECE) aims to identify emotion causes from a document after annotating the emotion keywords. Some baselines have been proposed to address this problem, such as rule-based, commonsense based and machine learning methods. We show, however, that a simple random selection approach toward ECE that does not require observing the text achieves similar performance compared to the… ▽ More

    Submitted 16 July, 2020; originally announced July 2020.

    Comments: 9 pages, 2 figures, 9 tables, bias, position bias, unbalanced labels, deep neural network models

  26. arXiv:2007.13829  [pdf, other

    cs.IR

    On using Product-Specific Schema.org from Web Data Commons: An Empirical Set of Best Practices

    Authors: Ravi Kiran Selvam, Mayank Kejriwal

    Abstract: Schema.org has experienced high growth in recent years. Structured descriptions of products embedded in HTML pages are now not uncommon, especially on e-commerce websites. The Web Data Commons (WDC) project has extracted schema.org data at scale from webpages in the Common Crawl and made it available as an RDF `knowledge graph' at scale. The portion of this data that specifically describes product… ▽ More

    Submitted 27 July, 2020; originally announced July 2020.

    Comments: 8 pages, 3 tables, 6 figures, published in Workshop on Knowledge Graphs and E-Commerce at KDD 2020 (non-archival)

  27. arXiv:1907.06745  [pdf, other

    cs.CL cs.LG cs.SI

    Low-supervision urgency detection and transfer in short crisis messages

    Authors: Mayank Kejriwal, Peilin Zhou

    Abstract: Humanitarian disasters have been on the rise in recent years due to the effects of climate change and socio-political situations such as the refugee crisis. Technology can be used to best mobilize resources such as food and water in the event of a natural disaster, by semi-automatically flagging tweets and short messages as indicating an urgent need. The problem is challenging not just because of… ▽ More

    Submitted 15 July, 2019; originally announced July 2019.

    Comments: 8 pages, short version published in ASONAM 2019

  28. arXiv:1801.05906  [pdf, other

    cs.IR

    Unsupervised Hashtag Retrieval and Visualization for Crisis Informatics

    Authors: Yao Gu, Mayank Kejriwal

    Abstract: In social media like Twitter, hashtags carry a lot of semantic information and can be easily distinguished from the main text. Exploring and visualizing the space of hashtags in a meaningful way can offer important insights into a dataset, especially in crisis situations. In this demonstration paper, we present a functioning prototype, HashViz, that ingests a corpus of tweets collected in the afte… ▽ More

    Submitted 17 January, 2018; originally announced January 2018.

    Comments: 2 pages, 3 figures, Workshop on Social Web in Emergency and Disaster Management at ACM WSDM 2018

  29. arXiv:1801.05881  [pdf, other

    cs.IR

    A Pipeline for Post-Crisis Twitter Data Acquisition

    Authors: Mayank Kejriwal, Yao Gu

    Abstract: Due to instant availability of data on social media platforms like Twitter, and advances in machine learning and data management technology, real-time crisis informatics has emerged as a prolific research area in the last decade. Although several benchmarks are now available, especially on portals like CrisisLex, an important, practical problem that has not been addressed thus far is the rapid acq… ▽ More

    Submitted 17 January, 2018; originally announced January 2018.

    Comments: 6 pages, 4 figures, Workshop on Social Web in Emergency and Disaster Management 2018 at the ACM WSDM Conference

  30. arXiv:1712.03086  [pdf, other

    cs.CY cs.AI cs.CL

    FlagIt: A System for Minimally Supervised Human Trafficking Indicator Mining

    Authors: Mayank Kejriwal, Jiayuan Ding, Runqi Shao, Anoop Kumar, Pedro Szekely

    Abstract: In this paper, we describe and study the indicator mining problem in the online sex advertising domain. We present an in-development system, FlagIt (Flexible and adaptive generation of Indicators from text), which combines the benefits of both a lightweight expert system and classical semi-supervision (heuristic re-labeling) with recently released state-of-the-art unsupervised text embeddings to t… ▽ More

    Submitted 5 December, 2017; originally announced December 2017.

    Comments: 6 pages, published in Workshop on Learning with Limited Labeled Data co-held with NIPS 2017

  31. arXiv:1712.00846  [pdf, other

    cs.AI cs.CY

    Always Lurking: Understanding and Mitigating Bias in Online Human Trafficking Detection

    Authors: Kyle Hundman, Thamme Gowda, Mayank Kejriwal, Benedikt Boecking

    Abstract: Web-based human trafficking activity has increased in recent years but it remains sparsely dispersed among escort advertisements and difficult to identify due to its often-latent nature. The use of intelligent systems to detect trafficking can thus have a direct impact on investigative resource allocation and decision-making, and, more broadly, help curb a widespread social problem. Trafficking de… ▽ More

    Submitted 3 December, 2017; originally announced December 2017.

    Comments: Submitted to 2018 AAAI 1st conference on AI, Ethics, and Society. Awaiting review

    Journal ref: AAAI/ACM First conference on Artificial Intelligence, Ethics, and Society, New Orleans, USA, February 2018

  32. Predicting Role Relevance with Minimal Domain Expertise in a Financial Domain

    Authors: Mayank Kejriwal

    Abstract: Word embeddings have made enormous inroads in recent years in a wide variety of text mining applications. In this paper, we explore a word embedding-based architecture for predicting the relevance of a role between two financial entities within the context of natural language sentences. In this extended abstract, we propose a pooled approach that uses a collection of sentences to train word embedd… ▽ More

    Submitted 18 April, 2017; originally announced April 2017.

    Comments: DSMM 2017 workshop at ACM SIGMOD conference

  33. Using Contexts and Constraints for Improved Geotagging of Human Trafficking Webpages

    Authors: Rahul Kapoor, Mayank Kejriwal, Pedro Szekely

    Abstract: Extracting geographical tags from webpages is a well-motivated application in many domains. In illicit domains with unusual language models, like human trafficking, extracting geotags with both high precision and recall is a challenging problem. In this paper, we describe a geotag extraction framework in which context, constraints and the openly available Geonames knowledge base work in tandem in… ▽ More

    Submitted 18 April, 2017; originally announced April 2017.

    Comments: 6 pages, GeoRich 2017 workshop at ACM SIGMOD conference

  34. Supervised Ty** of Big Graphs using Semantic Embeddings

    Authors: Mayank Kejriwal, Pedro Szekely

    Abstract: We propose a supervised algorithm for generating type embeddings in the same semantic vector space as a given set of entity embeddings. The algorithm is agnostic to the derivation of the underlying entity embeddings. It does not require any manual feature engineering, generalizes well to hundreds of types and achieves near-linear scaling on Big Graphs containing many millions of triples and instan… ▽ More

    Submitted 22 March, 2017; originally announced March 2017.

    Comments: 6 pages, to be published in Semantic Big Data Workshop at ACM, SIGMOD 2017; extended version in preparation for Open Journal of Semantic Web (OJSW)

  35. Information Extraction in Illicit Domains

    Authors: Mayank Kejriwal, Pedro Szekely

    Abstract: Extracting useful entities and attribute values from illicit domains such as human trafficking is a challenging problem with the potential for widespread social impact. Such domains employ atypical language models, have `long tails' and suffer from the problem of concept drift. In this paper, we propose a lightweight, feature-agnostic Information Extraction (IE) paradigm specifically designed for… ▽ More

    Submitted 8 March, 2017; originally announced March 2017.

    Comments: 10 pages, ACM WWW 2017

  36. arXiv:1609.06265  [pdf

    cs.AI cs.DB

    An Ensemble Blocking Scheme for Entity Resolution of Large and Sparse Datasets

    Authors: Janani Balaji, Faizan Javed, Mayank Kejriwal, Chris Min, Sam Sander, Ozgur Ozturk

    Abstract: Entity Resolution, also called record linkage or deduplication, refers to the process of identifying and merging duplicate versions of the same entity into a unified representation. The standard practice is to use a Rule based or Machine Learning based model that compares entity pairs and assigns a score to represent the pairs' Match/Non-Match status. However, performing an exhaustive pair-wise co… ▽ More

    Submitted 20 September, 2016; v1 submitted 20 September, 2016; originally announced September 2016.

  37. arXiv:1608.04442  [pdf, other

    cs.DB

    Experience: Type alignment on DBpedia and Freebase

    Authors: Mayank Kejriwal, Daniel P. Miranker

    Abstract: Linked Open Data exhibits growth in both volume and variety of published data. Due to this variety, instances of many different types (e.g. Person) can be found in published datasets. Type alignment is the problem of automatically matching types (in a possibly many-many fashion) between two such datasets. Type alignment is an important preprocessing step in instance matching. Instance matching con… ▽ More

    Submitted 15 August, 2016; originally announced August 2016.

  38. arXiv:1608.04437  [pdf, other

    cs.DB

    Self-contained NoSQL Resources for Cross-Domain RDF

    Authors: Mayank Kejriwal, Daniel P. Miranker

    Abstract: Cross-domain knowledge bases such as DBpedia, Freebase and YAGO have emerged as encyclopedic hubs in the Web of Linked Data. Despite enabling several practical applications in the Semantic Web, the large-scale, schema-free nature of such graphs often precludes research groups from employing them widely as evaluation test cases for entity resolution and instance-based ontology alignment application… ▽ More

    Submitted 15 August, 2016; originally announced August 2016.

  39. arXiv:1605.00686  [pdf, other

    cs.AI cs.DB

    Adaptive Candidate Generation for Scalable Edge-discovery Tasks on Data Graphs

    Authors: Mayank Kejriwal

    Abstract: Several `edge-discovery' applications over graph-based data models are known to have worst-case quadratic time complexity in the nodes, even if the discovered edges are sparse. One example is the generic link discovery problem between two graphs, which has invited research interest in several communities. Specific versions of this problem include link prediction in social networks, ontology alignm… ▽ More

    Submitted 30 June, 2017; v1 submitted 2 May, 2016; originally announced May 2016.

    Comments: 8 pages,published at MLG workshop at KDD'17

  40. arXiv:1501.01696  [pdf, other

    cs.CC cs.DS

    On the Complexity of Sorted Neighborhood

    Authors: Mayank Kejriwal, Daniel P. Miranker

    Abstract: Record linkage concerns identifying semantically equivalent records in databases. Blocking methods are employed to avoid the cost of full pairwise similarity comparisons on $n$ records. In a seminal work, Hernandez and Stolfo proposed the Sorted Neighborhood blocking method. Several empirical variants have been proposed in recent years. In this paper, we investigate the complexity of the Sorted Ne… ▽ More

    Submitted 7 January, 2015; originally announced January 2015.

  41. arXiv:1501.01694  [pdf, other

    cs.DB

    A DNF Blocking Scheme Learner for Heterogeneous Datasets

    Authors: Mayank Kejriwal, Daniel P. Miranker

    Abstract: Entity Resolution concerns identifying co-referent entity pairs across datasets. A typical workflow comprises two steps. In the first step, a blocking method uses a one-many function called a blocking scheme to map entities to blocks. In the second step, entities sharing a block are paired and compared. Current DNF blocking scheme learners (DNF-BSLs) apply only to structurally homogeneous tables.… ▽ More

    Submitted 7 January, 2015; originally announced January 2015.