Skip to main content

Showing 1–40 of 40 results for author: Dong, X L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.11131  [pdf, other

    cs.CL cs.AI cs.DB

    Are Large Language Models a Good Replacement of Taxonomies?

    Authors: Yushi Sun, Hao Xin, Kai Sun, Yifan Ethan Xu, Xiao Yang, Xin Luna Dong, Nan Tang, Lei Chen

    Abstract: Large language models (LLMs) demonstrate an impressive ability to internalize knowledge and answer natural language questions. Although previous studies validate that LLMs perform well on general knowledge while presenting poor performance on long-tail nuanced knowledge, the community is still doubtful about whether the traditional knowledge graphs should be replaced by LLMs. In this paper, we ask… ▽ More

    Submitted 20 June, 2024; v1 submitted 16 June, 2024; originally announced June 2024.

    Comments: Accepted by VLDB 2024

  2. arXiv:2406.04744  [pdf, other

    cs.CL

    CRAG -- Comprehensive RAG Benchmark

    Authors: Xiao Yang, Kai Sun, Hao Xin, Yushi Sun, Nikita Bhalla, Xiangsen Chen, Sajal Choudhary, Rongze Daniel Gui, Ziran Will Jiang, Ziyu Jiang, Lingkun Kong, Brian Moran, Jiaqi Wang, Yifan Ethan Xu, An Yan, Chenyu Yang, Eting Yuan, Hanwen Zha, Nan Tang, Lei Chen, Nicolas Scheffer, Yue Liu, Nirav Shah, Rakesh Wanga, Anuj Kumar , et al. (2 additional authors not shown)

    Abstract: Retrieval-Augmented Generation (RAG) has recently emerged as a promising solution to alleviate Large Language Model (LLM)'s deficiency in lack of knowledge. Existing RAG datasets, however, do not adequately represent the diverse and dynamic nature of real-world Question Answering (QA) tasks. To bridge this gap, we introduce the Comprehensive RAG Benchmark (CRAG), a factual question answering bench… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

  3. arXiv:2403.04735  [pdf, other

    cs.CV

    SnapNTell: Enhancing Entity-Centric Visual Question Answering with Retrieval Augmented Multimodal LLM

    Authors: Jielin Qiu, Andrea Madotto, Zhaojiang Lin, Paul A. Crook, Yifan Ethan Xu, Xin Luna Dong, Christos Faloutsos, Lei Li, Babak Damavandi, Seungwhan Moon

    Abstract: Vision-extended LLMs have made significant strides in Visual Question Answering (VQA). Despite these advancements, VLLMs still encounter substantial difficulties in handling queries involving long-tail entities, with a tendency to produce erroneous or hallucinated responses. In this work, we introduce a novel evaluative benchmark named \textbf{SnapNTell}, specifically tailored for entity-centric V… ▽ More

    Submitted 7 March, 2024; originally announced March 2024.

  4. arXiv:2402.10466  [pdf, other

    cs.CL cs.AI

    Large Language Models as Zero-shot Dialogue State Tracker through Function Calling

    Authors: Zekun Li, Zhiyu Zoey Chen, Mike Ross, Patrick Huber, Seungwhan Moon, Zhaojiang Lin, Xin Luna Dong, Adithya Sagar, Xifeng Yan, Paul A. Crook

    Abstract: Large language models (LLMs) are increasingly prevalent in conversational systems due to their advanced understanding and generative capabilities in general contexts. However, their effectiveness in task-oriented dialogues (TOD), which requires not only response generation but also effective dialogue state tracking (DST) within specific tasks and domains, remains less satisfying. In this work, we… ▽ More

    Submitted 30 May, 2024; v1 submitted 16 February, 2024; originally announced February 2024.

    Comments: ACL 2024 Main. Code available at: https://github.com/facebookresearch/FnCTOD

  5. arXiv:2402.08017  [pdf, other

    cs.CV cs.CL cs.LG

    Lumos : Empowering Multimodal LLMs with Scene Text Recognition

    Authors: Ashish Shenoy, Yichao Lu, Srihari Jayakumar, Debojeet Chatterjee, Mohsen Moslehpour, Pierce Chuang, Abhay Harpale, Vikas Bhardwaj, Di Xu, Shicong Zhao, Longfang Zhao, Ankit Ramchandani, Xin Luna Dong, Anuj Kumar

    Abstract: We introduce Lumos, the first end-to-end multimodal question-answering system with text understanding capabilities. At the core of Lumos is a Scene Text Recognition (STR) component that extracts text from first person point-of-view images, the output of which is used to augment input to a Multimodal Large Language Model (MM-LLM). While building Lumos, we encountered numerous challenges related to… ▽ More

    Submitted 1 June, 2024; v1 submitted 12 February, 2024; originally announced February 2024.

    Comments: Accepted to KDD 2024 (ADS Track)

  6. arXiv:2308.14217  [pdf, other

    cs.DB cs.AI cs.CL

    Generations of Knowledge Graphs: The Crazy Ideas and the Business Impact

    Authors: Xin Luna Dong

    Abstract: Knowledge Graphs (KGs) have been used to support a wide range of applications, from web search to personal assistant. In this paper, we describe three generations of knowledge graphs: entity-based KGs, which have been supporting general search and question answering (e.g., at Google and Bing); text-rich KGs, which have been supporting search and recommendations for products, bio-informatics, etc.… ▽ More

    Submitted 27 August, 2023; originally announced August 2023.

    Journal ref: PVLDB 2023

  7. arXiv:2308.10168  [pdf, other

    cs.CL

    Head-to-Tail: How Knowledgeable are Large Language Models (LLMs)? A.K.A. Will LLMs Replace Knowledge Graphs?

    Authors: Kai Sun, Yifan Ethan Xu, Hanwen Zha, Yue Liu, Xin Luna Dong

    Abstract: Since the recent prosperity of Large Language Models (LLMs), there have been interleaved discussions regarding how to reduce hallucinations from LLM responses, how to increase the factuality of LLMs, and whether Knowledge Graphs (KGs), which store the world knowledge in a symbolic form, will be replaced with LLMs. In this paper, we try to answer these questions from a new angle: How knowledgeable… ▽ More

    Submitted 2 April, 2024; v1 submitted 20 August, 2023; originally announced August 2023.

    Comments: To appear in NAACL 2024

  8. OA-Mine: Open-World Attribute Mining for E-Commerce Products with Weak Supervision

    Authors: Xinyang Zhang, Chenwei Zhang, Xian Li, Xin Luna Dong, **gbo Shang, Christos Faloutsos, Jiawei Han

    Abstract: Automatic extraction of product attributes from their textual descriptions is essential for online shopper experience. One inherent challenge of this task is the emerging nature of e-commerce products -- we see new types of products with their unique set of new attributes constantly. Most prior works on this matter mine new values for a set of known attributes but cannot handle new attributes that… ▽ More

    Submitted 29 April, 2022; originally announced April 2022.

    Comments: WWW 2022

  9. arXiv:2202.09747  [pdf, other

    cs.SI

    PGE: Robust Product Graph Embedding Learning for Error Detection

    Authors: Kewei Cheng, Xian Li, Yifan Ethan Xu, Xin Luna Dong, Yizhou Sun

    Abstract: Although product graphs (PGs) have gained increasing attentions in recent years for their successful applications in product search and recommendations, the extensive power of PGs can be limited by the inevitable involvement of various kinds of errors. Thus, it is critical to validate the correctness of triples in PGs to improve their reliability. Knowledge graph (KG) embedding methods have strong… ▽ More

    Submitted 20 February, 2022; originally announced February 2022.

  10. arXiv:2202.08069  [pdf, ps, other

    cs.DB cs.CY cs.HC

    VLDB 2021: Designing a Hybrid Conference

    Authors: Pınar Tözün, Felix Naumann, Philippe Bonnet, Xin Luna Dong

    Abstract: In 2020, while main database conferences one by one had to adopt a virtual format as a result of the ongoing COVID-19 pandemic, we decided to hold VLDB 2021 in hybrid format. This paper describes how we defined the hybrid format for VLDB 2021 going through the key design decisions. In addition, we list the lessons learned from running such a conference. Our goal is to share this knowledge with fel… ▽ More

    Submitted 26 January, 2022; originally announced February 2022.

  11. arXiv:2110.14509  [pdf, other

    cs.LG cs.DB

    Deep Transfer Learning for Multi-source Entity Linkage via Domain Adaptation

    Authors: Di **, Bunyamin Sisman, Hao Wei, Xin Luna Dong, Danai Koutra

    Abstract: Multi-source entity linkage focuses on integrating knowledge from multiple sources by linking the records that represent the same real world entity. This is critical in high-impact applications such as data cleaning and user stitching. The state-of-the-art entity linkage pipelines mainly depend on supervised learning that requires abundant amounts of training data. However, collecting well-labeled… ▽ More

    Submitted 27 October, 2021; originally announced October 2021.

  12. arXiv:2109.05460  [pdf, other

    cs.CL cs.AI

    End-to-End Conversational Search for Online Shop** with Utterance Transfer

    Authors: Liqiang Xiao, Jun Ma2, Xin Luna Dong, Pascual Martinez-Gomez, Nasser Zalmout, Wei Chen, Tong Zhao, Hao He, Yaohui **

    Abstract: Successful conversational search systems can present natural, adaptive and interactive shop** experience for online shop** customers. However, building such systems from scratch faces real word challenges from both imperfect product schema/knowledge and lack of training dialog data.In this work we first propose ConvSearch, an end-to-end conversational search system that deeply combines the dia… ▽ More

    Submitted 12 September, 2021; originally announced September 2021.

  13. arXiv:2106.04630  [pdf, other

    cs.CV cs.CL cs.LG

    PAM: Understanding Product Images in Cross Product Category Attribute Extraction

    Authors: Rongmei Lin, Xiang He, Jie Feng, Nasser Zalmout, Yan Liang, Li Xiong, Xin Luna Dong

    Abstract: Understanding product attributes plays an important role in improving online shop** experience for customers and serves as an integral part for constructing a product knowledge graph. Most existing methods focus on attribute extraction from text description or utilize visual information from product images such as shape and color. Compared to the inputs considered in prior works, a product image… ▽ More

    Submitted 8 June, 2021; originally announced June 2021.

    Comments: KDD 2021

  14. arXiv:2106.02318  [pdf, other

    cs.CL

    AdaTag: Multi-Attribute Value Extraction from Product Profiles with Adaptive Decoding

    Authors: Jun Yan, Nasser Zalmout, Yan Liang, Christan Grant, Xiang Ren, Xin Luna Dong

    Abstract: Automatic extraction of product attribute values is an important enabling technology in e-Commerce platforms. This task is usually modeled using sequence labeling architectures, with several extensions to handle multi-attribute extraction. One line of previous work constructs attribute-specific models, through separate decoders or entirely separate models. However, this approach constrains knowled… ▽ More

    Submitted 4 June, 2021; originally announced June 2021.

    Comments: Accepted to ACL-IJCNLP 2021

  15. arXiv:2106.00793  [pdf, other

    cs.CL

    CoRI: Collective Relation Integration with Data Augmentation for Open Information Extraction

    Authors: Zhengbao Jiang, Jialong Han, Bunyamin Sisman, Xin Luna Dong

    Abstract: Integrating extracted knowledge from the Web to knowledge graphs (KGs) can facilitate tasks like question answering. We study relation integration that aims to align free-text relations in subject-relation-object extractions to relations in a target KG. To address the challenge that free-text relations are ambiguous, previous methods exploit neighbor entities and relations for additional context.… ▽ More

    Submitted 1 June, 2021; originally announced June 2021.

    Comments: ACL 2021

  16. TCN: Table Convolutional Network for Web Table Interpretation

    Authors: Daheng Wang, Prashant Shiralkar, Colin Lockard, Binxuan Huang, Xin Luna Dong, Meng Jiang

    Abstract: Information extraction from semi-structured webpages provides valuable long-tailed facts for augmenting knowledge graph. Relational Web tables are a critical component containing additional entities and attributes of rich and diverse knowledge. However, extracting knowledge from relational tables is challenging because of sparse contextual information. Existing work linearize table cells and heavi… ▽ More

    Submitted 16 February, 2021; originally announced February 2021.

  17. arXiv:2011.05928  [pdf, other

    cs.IR cs.AI

    J-Recs: Principled and Scalable Recommendation Justification

    Authors: Namyong Park, Andrey Kan, Christos Faloutsos, Xin Luna Dong

    Abstract: Online recommendation is an essential functionality across a variety of services, including e-commerce and video streaming, where items to buy, watch, or read are suggested to users. Justifying recommendations, i.e., explaining why a user might like the recommended item, has been shown to improve user satisfaction and persuasiveness of the recommendation. In this paper, we develop a method for gen… ▽ More

    Submitted 11 November, 2020; originally announced November 2020.

    Comments: ICDM 2020

  18. arXiv:2009.07203  [pdf, other

    cs.DB cs.LG

    CorDEL: A Contrastive Deep Learning Approach for Entity Linkage

    Authors: Zhengyang Wang, Bunyamin Sisman, Hao Wei, Xin Luna Dong, Shuiwang Ji

    Abstract: Entity linkage (EL) is a critical problem in data cleaning and integration. In the past several decades, EL has typically been done by rule-based systems or traditional machine learning models with hand-curated features, both of which heavily depend on manual human inputs. With the ever-increasing growth of new data, deep learning (DL) based approaches have been proposed to alleviate the high cost… ▽ More

    Submitted 2 December, 2020; v1 submitted 15 September, 2020; originally announced September 2020.

    Comments: Accepted by the 20th IEEE International Conference on Data Mining (ICDM 2020)

  19. AutoKnow: Self-Driving Knowledge Collection for Products of Thousands of Types

    Authors: Xin Luna Dong, Xiang He, Andrey Kan, Xian Li, Yan Liang, Jun Ma, Yifan Ethan Xu, Chenwei Zhang, Tong Zhao, Gabriel Blanco Saldana, Saurabh Deshpande, Alexandre Michetti Manduca, Jay Ren, Surender Pal Singh, Fan Xiao, Haw-Shiuan Chang, Giannis Karamanolakis, Yuning Mao, Yaqing Wang, Christos Faloutsos, Andrew McCallum, Jiawei Han

    Abstract: Can one build a knowledge graph (KG) for all products in the world? Knowledge graphs have firmly established themselves as valuable sources of information for search and question answering, and it is natural to wonder if a KG can contain information about products offered at online retail sites. There have been several successful examples of generic KGs, but organizing information about products p… ▽ More

    Submitted 24 June, 2020; originally announced June 2020.

    Comments: KDD 2020

  20. MultiImport: Inferring Node Importance in a Knowledge Graph from Multiple Input Signals

    Authors: Namyong Park, Andrey Kan, Xin Luna Dong, Tong Zhao, Christos Faloutsos

    Abstract: Given multiple input signals, how can we infer node importance in a knowledge graph (KG)? Node importance estimation is a crucial and challenging task that can benefit a lot of applications including recommendation, search, and query disambiguation. A key challenge towards this goal is how to effectively use input from different sources. On the one hand, a KG is a rich source of information, with… ▽ More

    Submitted 22 June, 2020; originally announced June 2020.

    Comments: KDD 2020 Research Track. 10 pages

  21. Octet: Online Catalog Taxonomy Enrichment with Self-Supervision

    Authors: Yuning Mao, Tong Zhao, Andrey Kan, Chenwei Zhang, Xin Luna Dong, Christos Faloutsos, Jiawei Han

    Abstract: Taxonomies have found wide applications in various domains, especially online for item categorization, browsing, and search. Despite the prevalent use of online catalog taxonomies, most of them in practice are maintained by humans, which is labor-intensive and difficult to scale. While taxonomy construction from scratch is considerably studied in the literature, how to effectively enrich existing… ▽ More

    Submitted 18 June, 2020; originally announced June 2020.

    Comments: KDD 2020

  22. arXiv:2006.08779  [pdf, other

    cs.CL cs.LG

    Automatic Validation of Textual Attribute Values in E-commerce Catalog by Learning with Limited Labeled Data

    Authors: Yaqing Wang, Yifan Ethan Xu, Xian Li, Xin Luna Dong, **g Gao

    Abstract: Product catalogs are valuable resources for eCommerce website. In the catalog, a product is associated with multiple attributes whose values are short texts, such as product name, brand, functionality and flavor. Usually individual retailers self-report these key values, and thus the catalog information unavoidably contains noisy facts. Although existing deep neural network models have shown succe… ▽ More

    Submitted 22 June, 2020; v1 submitted 15 June, 2020; originally announced June 2020.

    Comments: KDD 2020

  23. arXiv:2005.07105  [pdf, other

    cs.CL cs.IR

    ZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured Webpages

    Authors: Colin Lockard, Prashant Shiralkar, Xin Luna Dong, Hannaneh Hajishirzi

    Abstract: In many documents, such as semi-structured webpages, textual semantics are augmented with additional information conveyed using visual elements including layout, font size, and color. Prior work on information extraction from semi-structured websites has required learning an extraction model specific to a given template via either manually labeled or distantly supervised data from that template. I… ▽ More

    Submitted 14 May, 2020; originally announced May 2020.

    Comments: Accepted to ACL 2020

  24. arXiv:2004.13852  [pdf, other

    cs.CL cs.IR cs.LG stat.ML

    TXtract: Taxonomy-Aware Knowledge Extraction for Thousands of Product Categories

    Authors: Giannis Karamanolakis, Jun Ma, Xin Luna Dong

    Abstract: Extracting structured knowledge from product profiles is crucial for various applications in e-Commerce. State-of-the-art approaches for knowledge extraction were each designed for a single category of product, and thus do not apply to real-life e-Commerce scenarios, which often contain thousands of diverse categories. This paper proposes TXtract, a taxonomy-aware knowledge extraction model that a… ▽ More

    Submitted 1 May, 2020; v1 submitted 14 April, 2020; originally announced April 2020.

    Comments: Accepted to ACL 2020 (Long Paper)

  25. AutoBlock: A Hands-off Blocking Framework for Entity Matching

    Authors: Wei Zhang, Hao Wei, Bunyamin Sisman, Xin Luna Dong, Christos Faloutsos, David Page

    Abstract: Entity matching seeks to identify data records over one or multiple data sources that refer to the same real-world entity. Virtually every entity matching task on large datasets requires blocking, a step that reduces the number of record pairs to be matched. However, most of the traditional blocking methods are learning-free and key-based, and their successes are largely built on laborious human e… ▽ More

    Submitted 6 December, 2019; originally announced December 2019.

    Comments: In The Thirteenth ACM International Conference on Web Search and Data Mining (WSDM '20), February 3-7, 2020, Houston, TX, USA. ACM, Anchorage, Alaska, USA , 9 pages

  26. arXiv:1907.09657  [pdf, other

    cs.DB

    Efficient Knowledge Graph Accuracy Evaluation

    Authors: Junyang Gao, Xian Li, Yifan Ethan Xu, Bunyamin Sisman, Xin Luna Dong, Jun Yang

    Abstract: Estimation of the accuracy of a large-scale knowledge graph (KG) often requires humans to annotate samples from the graph. How to obtain statistically meaningful estimates for accuracy evaluation while kee** human annotation costs low is a problem critical to the development cycle of a KG and its practical applications. Surprisingly, this challenging problem has largely been ignored in prior res… ▽ More

    Submitted 22 July, 2019; originally announced July 2019.

    Comments: in VLDB 2019

  27. arXiv:1905.08865  [pdf, other

    cs.LG cs.IR stat.ML

    Estimating Node Importance in Knowledge Graphs Using Graph Neural Networks

    Authors: Namyong Park, Andrey Kan, Xin Luna Dong, Tong Zhao, Christos Faloutsos

    Abstract: How can we estimate the importance of nodes in a knowledge graph (KG)? A KG is a multi-relational graph that has proven valuable for many tasks including question answering and semantic search. In this paper, we present GENI, a method for tackling the problem of estimating node importance in KGs, which enables several downstream applications such as item recommendation and resource allocation. Whi… ▽ More

    Submitted 16 June, 2019; v1 submitted 21 May, 2019; originally announced May 2019.

    Comments: KDD 2019 Research Track. 11 pages. Changelog: Type 3 font removed, and minor updates made in the Appendix (v2)

  28. arXiv:1904.12606  [pdf, other

    cs.IR cs.LG stat.ML

    OpenKI: Integrating Open Information Extraction and Knowledge Bases with Relation Inference

    Authors: Dongxu Zhang, Subhabrata Mukherjee, Colin Lockard, Xin Luna Dong, Andrew McCallum

    Abstract: In this paper, we consider advancing web-scale knowledge extraction and alignment by integrating OpenIE extractions in the form of (subject, predicate, object) triples with Knowledge Bases (KB). Traditional techniques from universal schema and from schema map** fall in two extremes: either they perform instance-level inference relying on embedding for (subject, object) pairs, thus cannot handle… ▽ More

    Submitted 12 April, 2019; originally announced April 2019.

  29. arXiv:1807.08447  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    LinkNBed: Multi-Graph Representation Learning with Entity Linkage

    Authors: Rakshit Trivedi, Bunyamin Sisman, Jun Ma, Christos Faloutsos, Hongyuan Zha, Xin Luna Dong

    Abstract: Knowledge graphs have emerged as an important model for studying complex multi-relational data. This has given rise to the construction of numerous large scale but incomplete knowledge graphs encoding information extracted from various resources. An effective and scalable approach to jointly learn over multiple graphs and eventually construct a unified graph is a crucial next step for the success… ▽ More

    Submitted 23 July, 2018; originally announced July 2018.

    Comments: ACL 2018

  30. arXiv:1806.01264  [pdf, other

    cs.CL cs.AI cs.IR stat.ML

    OpenTag: Open Attribute Value Extraction from Product Profiles [Deep Learning, Active Learning, Named Entity Recognition]

    Authors: Guineng Zheng, Subhabrata Mukherjee, Xin Luna Dong, Feifei Li

    Abstract: Extraction of missing attribute values is to find values describing an attribute of interest from a free text input. Most past related work on extraction of missing attribute values work with a closed world assumption with the possible set of values known beforehand, or use dictionaries of values and hand-crafted features. How can we discover new attribute values that we have never seen before? Ca… ▽ More

    Submitted 6 October, 2018; v1 submitted 1 June, 2018; originally announced June 2018.

    Comments: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK, August 19-23, 2018

  31. arXiv:1804.04635  [pdf, other

    cs.AI cs.IR

    CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web

    Authors: Colin Lockard, Xin Luna Dong, Arash Einolghozati, Prashant Shiralkar

    Abstract: The web contains countless semi-structured websites, which can be a rich source of information for populating knowledge bases. Existing methods for extracting relations from the DOM trees of semi-structured webpages can achieve high precision and recall only when manual annotations for each website are available. Although there have been efforts to learn extractors from automatically-generated lab… ▽ More

    Submitted 12 April, 2018; originally announced April 2018.

    Comments: Expanded version of paper under review for VLDB

  32. arXiv:1705.04915  [pdf, other

    cs.DB

    Discovering Multiple Truths with a Hybrid Model

    Authors: Furong Li, Xin Luna Dong, Anno Langen, Yang Li

    Abstract: Many data management applications require integrating information from multiple sources. The sources may not be accurate and provide erroneous values. We thus have to identify the true values from conflicting observations made by the sources. The problem is further complicated when there may exist multiple truths (e.g., a book written by several authors). In this paper we propose a model called Hy… ▽ More

    Submitted 14 May, 2017; originally announced May 2017.

  33. arXiv:1503.00604  [pdf, other

    cs.DB

    Robust Group Linkage

    Authors: Pei Li, Xin Luna Dong, Songtao Guo, Andrea Maurino, Divesh Srivastava

    Abstract: We study the problem of group linkage: linking records that refer to entities in the same group. Applications for group linkage include finding businesses in the same chain, finding conference attendees from the same affiliation, finding players from the same team, etc. Group linkage faces challenges not present for traditional record linkage. First, although different members in the same group ca… ▽ More

    Submitted 2 March, 2015; originally announced March 2015.

  34. arXiv:1503.00310  [pdf, ps, other

    cs.DB

    Data Fusion: Resolving Conflicts from Multiple Sources

    Authors: Xin Luna Dong, Laure Berti-Equille, Divesh Srivastava

    Abstract: Many data management applications, such as setting up Web portals, managing enterprise data, managing community data, and sharing scientific data, require integrating data from multiple sources. Each of these sources provides a set of values and different sources can often provide conflicting values. To present quality data to users, it is critical to resolve conflicts and discover values that ref… ▽ More

    Submitted 1 March, 2015; originally announced March 2015.

    Comments: WAIM 2013

  35. arXiv:1503.00309  [pdf, other

    cs.DB

    Scaling up Copy Detection

    Authors: Xian Li, Xin Luna Dong, Kenneth B. Lyons, Weiyi Meng, Divesh Srivastava

    Abstract: Recent research shows that copying is prevalent for Deep-Web data and considering copying can significantly improve truth finding from conflicting values. However, existing copy detection techniques do not scale for large sizes and numbers of data sources, so truth finding can be slowed down by one to two orders of magnitude compared with the corresponding techniques that do not consider copying.… ▽ More

    Submitted 1 March, 2015; originally announced March 2015.

    Comments: ICDE 2015

  36. arXiv:1503.00306  [pdf, other

    cs.DB

    Fusing Data with Correlations

    Authors: Ravali Pochampally, Anish Das Sarma, Xin Luna Dong, Alexandra Meliou, Divesh Srivastava

    Abstract: Many applications rely on Web data and extraction systems to accomplish knowledge-driven tasks. Web information is not curated, so many sources provide inaccurate, or conflicting information. Moreover, extraction systems introduce additional noise to the data. We wish to automatically distinguish correct data and erroneous data for creating a cleaner set of integrated data. Previous work has shown… ▽ More

    Submitted 1 March, 2015; originally announced March 2015.

    Comments: Sigmod'2014

  37. arXiv:1503.00303  [pdf, other

    cs.DB cs.IR

    Truth Finding on the Deep Web: Is the Problem Solved?

    Authors: Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, Divesh Srivastava

    Abstract: The amount of useful information available on the Web has been growing at a dramatic pace in recent years and people rely more and more on the Web to fulfill their information needs. In this paper, we study truthfulness of Deep Web data in two domains where we believed data are fairly clean and data quality is important to people's lives: {\em Stock} and {\em Flight}. To our surprise, we observed… ▽ More

    Submitted 1 March, 2015; originally announced March 2015.

    Comments: VLDB'2013

  38. arXiv:1503.00302  [pdf, other

    cs.DB

    From Data Fusion to Knowledge Fusion

    Authors: Xin Luna Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Kevin Murphy, Shaohua Sun, Wei Zhang

    Abstract: The task of {\em data fusion} is to identify the true values of data items (eg, the true date of birth for {\em Tom Cruise}) among multiple observed values drawn from different sources (eg, Web sites) of varying (and unknown) reliability. A recent survey\cite{LDL+12} has provided a detailed comparison of various fusion methods on Deep Web data. In this paper, we study the applicability and limitat… ▽ More

    Submitted 1 March, 2015; originally announced March 2015.

    Comments: VLDB'2014

  39. TimeMachine: Timeline Generation for Knowledge-Base Entities

    Authors: Tim Althoff, Xin Luna Dong, Kevin Murphy, Safa Alai, Van Dang, Wei Zhang

    Abstract: We present a method called TIMEMACHINE to generate a timeline of events and relations for entities in a knowledge base. For example for an actor, such a timeline should show the most important professional and personal milestones and relationships such as works, awards, collaborations, and family relationships. We develop three orthogonal timeline quality criteria that an ideal timeline should sat… ▽ More

    Submitted 8 June, 2015; v1 submitted 16 February, 2015; originally announced February 2015.

    Comments: To appear at ACM SIGKDD KDD'15. 12pp, 7 fig. With appendix. Demo and other info available at http://cs.stanford.edu/~althoff/timemachine/

    ACM Class: H.2.8

  40. arXiv:1502.03519  [pdf, other

    cs.DB cs.IR

    Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources

    Authors: Xin Luna Dong, Evgeniy Gabrilovich, Kevin Murphy, Van Dang, Wilko Horn, Camillo Lugaresi, Shaohua Sun, Wei Zhang

    Abstract: The quality of web sources has been traditionally evaluated using exogenous signals such as the hyperlink structure of the graph. We propose a new approach that relies on endogenous signals, namely, the correctness of factual information provided by the source. A source that has few false facts is considered to be trustworthy. The facts are automatically extracted from each source by information e… ▽ More

    Submitted 11 February, 2015; originally announced February 2015.