-
Demonstration of LogicLib: An Expressive Multi-Language Interface over Scalable Datalog System
Authors:
Mingda Li,
** Wang,
Guorui Xiao,
Youfu Li,
Carlo Zaniolo
Abstract:
With the ever-increasing volume of data, there is an urgent need to provide expressive and efficient tools to support Big Data analytics. The declarative logical language Datalog has proven very effective at expressing concisely graph, machine learning, and knowledge discovery applications via recursive queries. In this demonstration, we develop Logic Library (LLib), a library of recursive algorit…
▽ More
With the ever-increasing volume of data, there is an urgent need to provide expressive and efficient tools to support Big Data analytics. The declarative logical language Datalog has proven very effective at expressing concisely graph, machine learning, and knowledge discovery applications via recursive queries. In this demonstration, we develop Logic Library (LLib), a library of recursive algorithms written in Datalog that can be executed in BigDatalog, a Datalog engine on top of Apache Spark developed by us. LLib encapsulates complex logic-based algorithms into high-level APIs, which simplify the development and provide a unified interface akin to the one of Spark MLlib. As LLib is fully compatible with DataFrame, it enables the integrated utilization of its built-in applications and new Datalog queries with existing Spark functions, such as those provided by MLlib and Spark SQL. With a variety of examples, we will (i) show how to write programs with LLib to express a variety of applications; (ii) illustrate its user experience in Apache Spark ecosystem; and (iii) present a user-friendly interface to interact with the LLib framework and monitor the query results.
△ Less
Submitted 5 September, 2022; v1 submitted 30 May, 2022;
originally announced May 2022.
-
Bio-JOIE: Joint Representation Learning of Biological Knowledge Bases
Authors:
Junheng Hao,
Chelsea Ju,
Muhao Chen,
Yizhou Sun,
Carlo Zaniolo,
Wei Wang
Abstract:
The widespread of Coronavirus has led to a worldwide pandemic with a high mortality rate. Currently, the knowledge accumulated from different studies about this virus is very limited. Leveraging a wide-range of biological knowledge, such as gene ontology and protein-protein interaction (PPI) networks from other closely related species presents a vital approach to infer the molecular impact of a ne…
▽ More
The widespread of Coronavirus has led to a worldwide pandemic with a high mortality rate. Currently, the knowledge accumulated from different studies about this virus is very limited. Leveraging a wide-range of biological knowledge, such as gene ontology and protein-protein interaction (PPI) networks from other closely related species presents a vital approach to infer the molecular impact of a new species. In this paper, we propose the transferred multi-relational embedding model Bio-JOIE to capture the knowledge of gene ontology and PPI networks, which demonstrates superb capability in modeling the SARS-CoV-2-human protein interactions. Bio-JOIE jointly trains two model components. The knowledge model encodes the relational facts from the protein and GO domains into separated embedding spaces, using a hierarchy-aware encoding technique employed for the GO terms. On top of that, the transfer model learns a non-linear transformation to transfer the knowledge of PPIs and gene ontology annotations across their embedding spaces. By leveraging only structured knowledge, Bio-JOIE significantly outperforms existing state-of-the-art methods in PPI type prediction on multiple species. Furthermore, we also demonstrate the potential of leveraging the learned representations on clustering proteins with enzymatic function into enzyme commission families. Finally, we show that Bio-JOIE can accurately identify PPIs between the SARS-CoV-2 proteins and human proteins, providing valuable insights for advancing research on this new disease.
△ Less
Submitted 7 March, 2021;
originally announced March 2021.
-
Multilingual Knowledge Graph Completion via Ensemble Knowledge Transfer
Authors:
Xuelu Chen,
Muhao Chen,
Changjun Fan,
Ankith Uppunda,
Yizhou Sun,
Carlo Zaniolo
Abstract:
Predicting missing facts in a knowledge graph (KG) is a crucial task in knowledge base construction and reasoning, and it has been the subject of much research in recent works using KG embeddings. While existing KG embedding approaches mainly learn and predict facts within a single KG, a more plausible solution would benefit from the knowledge in multiple language-specific KGs, considering that di…
▽ More
Predicting missing facts in a knowledge graph (KG) is a crucial task in knowledge base construction and reasoning, and it has been the subject of much research in recent works using KG embeddings. While existing KG embedding approaches mainly learn and predict facts within a single KG, a more plausible solution would benefit from the knowledge in multiple language-specific KGs, considering that different KGs have their own strengths and limitations on data quality and coverage. This is quite challenging, since the transfer of knowledge among multiple independently maintained KGs is often hindered by the insufficiency of alignment information and the inconsistency of described facts. In this paper, we propose KEnS, a novel framework for embedding learning and ensemble knowledge transfer across a number of language-specific KGs. KEnS embeds all KGs in a shared embedding space, where the association of entities is captured based on self-learning. Then, KEnS performs ensemble inference to combine prediction results from embeddings of multiple language-specific KGs, for which multiple ensemble techniques are investigated. Experiments on five real-world language-specific KGs show that KEnS consistently improves state-of-the-art methods on KG completion, via effectively identifying and leveraging complementary knowledge.
△ Less
Submitted 8 October, 2020; v1 submitted 7 October, 2020;
originally announced October 2020.
-
Monotonic Properties of Completed Aggregates in Recursive Queries
Authors:
Carlo Zaniolo,
Ariyam Das,
Jiaqi Gu,
Youfu Li,
Mingda li,
** Wang
Abstract:
The use of aggregates in recursion enables efficient and scalable support for a wide range of BigData algorithms, including those used in graph applications, KDD applications, and ML applications, which have proven difficult to be expressed and supported efficiently in BigData systems supporting Datalog or SQL. The problem with these languages and systems is that, to avoid the semantic and computa…
▽ More
The use of aggregates in recursion enables efficient and scalable support for a wide range of BigData algorithms, including those used in graph applications, KDD applications, and ML applications, which have proven difficult to be expressed and supported efficiently in BigData systems supporting Datalog or SQL. The problem with these languages and systems is that, to avoid the semantic and computational issues created by non-monotonic constructs in recursion, they only allow programs that are stratified with respect to negation and aggregates. Now, while this crippling restriction is well-justified for negation, it is frequently unjustified for aggregates, since (i) aggregates are often monotonic in the standard lattice of set-containment, (ii) the PreM property guarantees that programs with extrema in recursion are equivalent to stratified programs where extrema are used as post-constraints, and (iii) any program computing any aggregates on sets of facts of predictable cardinality tantamounts to stratified programs where the precomputation of the cardinality of the set is followed by a stratum where recursive rules only use monotonic constructs. With (i) and (ii) covered in previous papers, this paper focuses on (iii) using examples of great practical interest. For such examples, we provide a formal semantics that is conducive to efficient and scalable implementations via well-known techniques such as semi-naive fixpoint currently supported by most Datalog and SQL3 systems.
△ Less
Submitted 19 October, 2019;
originally announced October 2019.
-
BigData Applications from Graph Analytics to Machine Learning by Aggregates in Recursion
Authors:
Ariyam Das,
Youfu Li,
** Wang,
Mingda Li,
Carlo Zaniolo
Abstract:
In the past, the semantic issues raised by the non-monotonic nature of aggregates often prevented their use in the recursive statements of logic programs and deductive databases. However, the recently introduced notion of Pre-mappability (PreM) has shown that, in key applications of interest, aggregates can be used in recursion to optimize the perfect-model semantics of aggregate-stratified progra…
▽ More
In the past, the semantic issues raised by the non-monotonic nature of aggregates often prevented their use in the recursive statements of logic programs and deductive databases. However, the recently introduced notion of Pre-mappability (PreM) has shown that, in key applications of interest, aggregates can be used in recursion to optimize the perfect-model semantics of aggregate-stratified programs. Therefore we can preserve the declarative formal semantics of such programs while achieving a highly efficient operational semantics that is conducive to scalable implementations on parallel and distributed platforms. In this paper, we show that with PreM, a wide spectrum of classical algorithms of practical interest, ranging from graph analytics and dynamic programming based optimization problems to data mining and machine learning applications can be concisely expressed in declarative languages by using aggregates in recursion. Our examples are also used to show that PreM can be checked using simple techniques and templatized verification strategies. A wide range of advanced BigData applications can now be expressed declaratively in logic-based languages, including Datalog, Prolog, and even SQL, while enabling their execution with superior performance and scalability.
△ Less
Submitted 18 September, 2019;
originally announced September 2019.
-
A Case for Stale Synchronous Distributed Model for Declarative Recursive Computation
Authors:
Ariyam Das,
Carlo Zaniolo
Abstract:
A large class of traditional graph and data mining algorithms can be concisely expressed in Datalog, and other Logic-based languages, once aggregates are allowed in recursion. In fact, for most BigData algorithms, the difficult semantic issues raised by the use of non-monotonic aggregates in recursion are solved by Pre-Mappability (PreM), a property that assures that for a program with aggregates…
▽ More
A large class of traditional graph and data mining algorithms can be concisely expressed in Datalog, and other Logic-based languages, once aggregates are allowed in recursion. In fact, for most BigData algorithms, the difficult semantic issues raised by the use of non-monotonic aggregates in recursion are solved by Pre-Mappability (PreM), a property that assures that for a program with aggregates in recursion there is an equivalent aggregate-stratified program. In this paper we show that, by bringing together the formal abstract semantics of stratified programs with the efficient operational one of unstratified programs, PreM can also facilitate and improve their parallel execution. We prove that PreM-optimized lock-free and decomposable parallel semi-naive evaluations produce the same results as the single executor programs. Therefore, PreM can be assimilated into the data-parallel computation plans of different distributed systems, irrespective of whether these follow bulk synchronous parallel (BSP) or asynchronous computing models. In addition, we show that non-linear recursive queries can be evaluated using a hybrid stale synchronous parallel (SSP) model on distributed environments. After providing a formal correctness proof for the recursive query evaluation with PreM under this relaxed synchronization model, we present experimental evidence of its benefits. This paper is under consideration for acceptance in Theory and Practice of Logic Programming (TPLP).
△ Less
Submitted 24 July, 2019;
originally announced July 2019.
-
Quantification and Analysis of Scientific Language Variation Across Research Fields
Authors:
Pei Zhou,
Muhao Chen,
Kai-Wei Chang,
Carlo Zaniolo
Abstract:
Quantifying differences in terminologies from various academic domains has been a longstanding problem yet to be solved. We propose a computational approach for analyzing linguistic variation among scientific research fields by capturing the semantic change of terms based on a neural language model. The model is trained on a large collection of literature in five computer science research fields,…
▽ More
Quantifying differences in terminologies from various academic domains has been a longstanding problem yet to be solved. We propose a computational approach for analyzing linguistic variation among scientific research fields by capturing the semantic change of terms based on a neural language model. The model is trained on a large collection of literature in five computer science research fields, for which we obtain field-specific vector representations for key terms, and global vector representations for other words. Several quantitative approaches are introduced to identify the terms whose semantics have drastically changed, or remain unchanged across different research fields. We also propose a metric to quantify the overall linguistic variation of research fields. After quantitative evaluation on human annotated data and qualitative comparison with other methods, we show that our model can improve cross-disciplinary data collaboration by identifying terms that potentially induce confusion during interdisciplinary studies.
△ Less
Submitted 4 December, 2018;
originally announced December 2018.
-
Embedding Uncertain Knowledge Graphs
Authors:
Xuelu Chen,
Muhao Chen,
Weijia Shi,
Yizhou Sun,
Carlo Zaniolo
Abstract:
Embedding models for deterministic Knowledge Graphs (KG) have been extensively studied, with the purpose of capturing latent semantic relations between entities and incorporating the structured knowledge into machine learning. However, there are many KGs that model uncertain knowledge, which typically model the inherent uncertainty of relations facts with a confidence score, and embedding such unc…
▽ More
Embedding models for deterministic Knowledge Graphs (KG) have been extensively studied, with the purpose of capturing latent semantic relations between entities and incorporating the structured knowledge into machine learning. However, there are many KGs that model uncertain knowledge, which typically model the inherent uncertainty of relations facts with a confidence score, and embedding such uncertain knowledge represents an unresolved challenge. The capturing of uncertain knowledge will benefit many knowledge-driven applications such as question answering and semantic search by providing more natural characterization of the knowledge. In this paper, we propose a novel uncertain KG embedding model UKGE, which aims to preserve both structural and uncertainty information of relation facts in the embedding space. Unlike previous models that characterize relation facts with binary classification techniques, UKGE learns embeddings according to the confidence scores of uncertain relation facts. To further enhance the precision of UKGE, we also introduce probabilistic soft logic to infer confidence scores for unseen relation facts during training. We propose and evaluate two variants of UKGE based on different learning objectives. Experiments are conducted on three real-world uncertain KGs via three tasks, i.e. confidence prediction, relation fact ranking, and relation fact classification. UKGE shows effectiveness in capturing uncertain knowledge by achieving promising results on these tasks, and consistently outperforms baselines on these tasks.
△ Less
Submitted 25 February, 2019; v1 submitted 26 November, 2018;
originally announced November 2018.
-
On2Vec: Embedding-based Relation Prediction for Ontology Population
Authors:
Muhao Chen,
Yingtao Tian,
Xuelu Chen,
Zijun Xue,
Carlo Zaniolo
Abstract:
Populating ontology graphs represents a long-standing problem for the Semantic Web community. Recent advances in translation-based graph embedding methods for populating instance-level knowledge graphs lead to promising new approaching for the ontology population problem. However, unlike instance-level graphs, the majority of relation facts in ontology graphs come with comprehensive semantic relat…
▽ More
Populating ontology graphs represents a long-standing problem for the Semantic Web community. Recent advances in translation-based graph embedding methods for populating instance-level knowledge graphs lead to promising new approaching for the ontology population problem. However, unlike instance-level graphs, the majority of relation facts in ontology graphs come with comprehensive semantic relations, which often include the properties of transitivity and symmetry, as well as hierarchical relations. These comprehensive relations are often too complex for existing graph embedding methods, and direct application of such methods is not feasible. Hence, we propose On2Vec, a novel translation-based graph embedding method for ontology population. On2Vec integrates two model components that effectively characterize comprehensive relation facts in ontology graphs. The first is the Component-specific Model that encodes concepts and relations into low-dimensional embedding spaces without a loss of relational properties; the second is the Hierarchy Model that performs focused learning of hierarchical relation facts. Experiments on several well-known ontology graphs demonstrate the promising capabilities of On2Vec in predicting and verifying new relation facts. These promising results also make possible significant improvements in related methods.
△ Less
Submitted 7 September, 2018;
originally announced September 2018.
-
Learning to Represent Bilingual Dictionaries
Authors:
Muhao Chen,
Yingtao Tian,
Haochen Chen,
Kai-Wei Chang,
Steven Skiena,
Carlo Zaniolo
Abstract:
Bilingual word embeddings have been widely used to capture the similarity of lexical semantics in different human languages. However, many applications, such as cross-lingual semantic search and question answering, can be largely benefited from the cross-lingual correspondence between sentences and lexicons. To bridge this gap, we propose a neural embedding model that leverages bilingual dictionar…
▽ More
Bilingual word embeddings have been widely used to capture the similarity of lexical semantics in different human languages. However, many applications, such as cross-lingual semantic search and question answering, can be largely benefited from the cross-lingual correspondence between sentences and lexicons. To bridge this gap, we propose a neural embedding model that leverages bilingual dictionaries. The proposed model is trained to map the literal word definitions to the cross-lingual target words, for which we explore with different sentence encoding techniques. To enhance the learning process on limited resources, our model adopts several critical learning strategies, including multi-task learning on different bridges of languages, and joint learning of the dictionary model with a bilingual word embedding model. Experimental evaluation focuses on two applications. The results of the cross-lingual reverse dictionary retrieval task show our model's promising ability of comprehending bilingual concepts based on descriptions, and highlight the effectiveness of proposed learning strategies in improving performance. Meanwhile, our model effectively addresses the bilingual paraphrase identification problem and significantly outperforms previous approaches.
△ Less
Submitted 6 September, 2019; v1 submitted 10 August, 2018;
originally announced August 2018.
-
Neural Article Pair Modeling for Wikipedia Sub-article Matching
Authors:
Muhao Chen,
Chang** Meng,
Gang Huang,
Carlo Zaniolo
Abstract:
Nowadays, editors tend to separate different subtopics of a long Wiki-pedia article into multiple sub-articles. This separation seeks to improve human readability. However, it also has a deleterious effect on many Wikipedia-based tasks that rely on the article-as-concept assumption, which requires each entity (or concept) to be described solely by one article. This underlying assumption significan…
▽ More
Nowadays, editors tend to separate different subtopics of a long Wiki-pedia article into multiple sub-articles. This separation seeks to improve human readability. However, it also has a deleterious effect on many Wikipedia-based tasks that rely on the article-as-concept assumption, which requires each entity (or concept) to be described solely by one article. This underlying assumption significantly simplifies knowledge representation and extraction, and it is vital to many existing technologies such as automated knowledge base construction, cross-lingual knowledge alignment, semantic search and data lineage of Wikipedia entities. In this paper we provide an approach to match the scattered sub-articles back to their corresponding main-articles, with the intent of facilitating automated Wikipedia curation and processing. The proposed model adopts a hierarchical learning structure that combines multiple variants of neural document pair encoders with a comprehensive set of explicit features. A large crowdsourced dataset is created to support the evaluation and feature extraction for the task. Based on the large dataset, the proposed model achieves promising results of cross-validation and significantly outperforms previous approaches. Large-scale serving on the entire English Wikipedia also proves the practicability and scalability of the proposed model by effectively extracting a vast collection of newly paired main and sub-articles.
△ Less
Submitted 4 August, 2018; v1 submitted 31 July, 2018;
originally announced July 2018.
-
Scaling-Up Reasoning and Advanced Analytics on BigData
Authors:
Tyson Condie,
Ariyam Das,
Matteo Interlandi,
Alexander Shkapsky,
Mohan Yang,
Carlo Zaniolo
Abstract:
BigDatalog is an extension of Datalog that achieves performance and scalability on both Apache Spark and multicore systems to the point that its graph analytics outperform those written in GraphX. Looking back, we see how this realizes the ambitious goal pursued by deductive database researchers beginning forty years ago: this is the goal of combining the rigor and power of logic in expressing que…
▽ More
BigDatalog is an extension of Datalog that achieves performance and scalability on both Apache Spark and multicore systems to the point that its graph analytics outperform those written in GraphX. Looking back, we see how this realizes the ambitious goal pursued by deductive database researchers beginning forty years ago: this is the goal of combining the rigor and power of logic in expressing queries and reasoning with the performance and scalability by which relational databases managed Big Data. This goal led to Datalog which is based on Horn Clauses like Prolog but employs implementation techniques, such as Semi-naive Fixpoint and Magic Sets, that extend the bottom-up computation model of relational systems, and thus obtain the performance and scalability that relational systems had achieved, as far back as the 80s, using data-parallelization on shared-nothing architectures. But this goal proved difficult to achieve because of major issues at (i) the language level and (ii) at the system level. The paper describes how (i) was addressed by simple rules under which the fixpoint semantics extends to programs using count, sum and extrema in recursion, and (ii) was tamed by parallel compilation techniques that achieve scalability on multicore systems and Apache Spark. This paper is under consideration for acceptance in Theory and Practice of Logic Programming (TPLP).
△ Less
Submitted 9 July, 2018;
originally announced July 2018.
-
Co-training Embeddings of Knowledge Graphs and Entity Descriptions for Cross-lingual Entity Alignment
Authors:
Muhao Chen,
Yingtao Tian,
Kai-Wei Chang,
Steven Skiena,
Carlo Zaniolo
Abstract:
Multilingual knowledge graph (KG) embeddings provide latent semantic representations of entities and structured knowledge with cross-lingual inferences, which benefit various knowledge-driven cross-lingual NLP tasks. However, precisely learning such cross-lingual inferences is usually hindered by the low coverage of entity alignment in many KGs. Since many multilingual KGs also provide literal des…
▽ More
Multilingual knowledge graph (KG) embeddings provide latent semantic representations of entities and structured knowledge with cross-lingual inferences, which benefit various knowledge-driven cross-lingual NLP tasks. However, precisely learning such cross-lingual inferences is usually hindered by the low coverage of entity alignment in many KGs. Since many multilingual KGs also provide literal descriptions of entities, in this paper, we introduce an embedding-based approach which leverages a weakly aligned multilingual KG for semi-supervised cross-lingual learning using entity descriptions. Our approach performs co-training of two embedding models, i.e. a multilingual KG embedding model and a multilingual literal description embedding model. The models are trained on a large Wikipedia-based trilingual dataset where most entity alignment is unknown to training. Experimental results show that the performance of the proposed approach on the entity alignment task improves at each iteration of co-training, and eventually reaches a stage at which it significantly surpasses previous approaches. We also show that our approach has promising abilities for zero-shot entity alignment, and cross-lingual KG completion.
△ Less
Submitted 17 June, 2018;
originally announced June 2018.
-
How Much Are You Willing to Share? A "Poker-Styled" Selective Privacy Preserving Framework for Recommender Systems
Authors:
Manoj Reddy Dareddy,
Ariyam Das,
Junghoo Cho,
Carlo Zaniolo
Abstract:
Most industrial recommender systems rely on the popular collaborative filtering (CF) technique for providing personalized recommendations to its users. However, the very nature of CF is adversarial to the idea of user privacy, because users need to share their preferences with others in order to be grouped with like-minded people and receive accurate recommendations. While previous privacy preserv…
▽ More
Most industrial recommender systems rely on the popular collaborative filtering (CF) technique for providing personalized recommendations to its users. However, the very nature of CF is adversarial to the idea of user privacy, because users need to share their preferences with others in order to be grouped with like-minded people and receive accurate recommendations. While previous privacy preserving approaches have been successful inasmuch as they concealed user preference information to some extent from a centralized recommender system, they have also, nevertheless, incurred significant trade-offs in terms of privacy, scalability, and accuracy. They are also vulnerable to privacy breaches by malicious actors. In light of these observations, we propose a novel selective privacy preserving (SP2) paradigm that allows users to custom define the scope and extent of their individual privacies, by marking their personal ratings as either public (which can be shared) or private (which are never shared and stored only on the user device). Our SP2 framework works in two steps: (i) First, it builds an initial recommendation model based on the sum of all public ratings that have been shared by users and (ii) then, this public model is fine-tuned on each user's device based on the user private ratings, thus eventually learning a more accurate model. Furthermore, in this work, we introduce three different algorithms for implementing an end-to-end SP2 framework that can scale effectively from thousands to hundreds of millions of items. Our user survey shows that an overwhelming fraction of users are likely to rate much more items to improve the overall recommendations when they can control what ratings will be publicly shared with others.
△ Less
Submitted 3 June, 2018;
originally announced June 2018.
-
Fixpoint Semantics and Optimization of Recursive Datalog Programs with Aggregates
Authors:
Carlo Zaniolo,
Mohan Yang,
Matteo Interlandi,
Ariyam Das,
Alexander Shkapsky,
Tyson Condie
Abstract:
A very desirable Datalog extension investigated by many researchers in the last thirty years consists in allowing the use of the basic SQL aggregates min, max, count and sum in recursive rules. In this paper, we propose a simple comprehensive solution that extends the declarative least-fixpoint semantics of Horn Clauses, along with the optimization techniques used in the bottom-up implementation a…
▽ More
A very desirable Datalog extension investigated by many researchers in the last thirty years consists in allowing the use of the basic SQL aggregates min, max, count and sum in recursive rules. In this paper, we propose a simple comprehensive solution that extends the declarative least-fixpoint semantics of Horn Clauses, along with the optimization techniques used in the bottom-up implementation approach adopted by many Datalog systems. We start by identifying a large class of programs of great practical interest in which the use of min or max in recursive rules does not compromise the declarative fixpoint semantics of the programs using those rules. Then, we revisit the monotonic versions of count and sum aggregates proposed in (Mazuran et al. 2013b) and named, respectively, mcount and msum. Since mcount, and also msum on positive numbers, are monotonic in the lattice of set-containment, they preserve the fixpoint semantics of Horn Clauses. However, in many applications of practical interest, their use can lead to inefficiencies, that can be eliminated by combining them with max, whereby mcount and msum become the standard count and sum. Therefore, the semantics and optimization techniques of Datalog are extended to recursive programs with min, max, count and sum, making possible the advanced applications of superior performance and scalability demonstrated by BigDatalog (Shkapsky et al. 2016) and Datalog-MC (Yang et al. 2017). This paper is under consideration for acceptance in TPLP.
△ Less
Submitted 21 July, 2017; v1 submitted 18 July, 2017;
originally announced July 2017.
-
Multilingual Knowledge Graph Embeddings for Cross-lingual Knowledge Alignment
Authors:
Muhao Chen,
Yingtao Tian,
Mohan Yang,
Carlo Zaniolo
Abstract:
Many recent works have demonstrated the benefits of knowledge graph embeddings in completing monolingual knowledge graphs. Inasmuch as related knowledge bases are built in several different languages, achieving cross-lingual knowledge alignment will help people in constructing a coherent knowledge base, and assist machines in dealing with different expressions of entity relationships across divers…
▽ More
Many recent works have demonstrated the benefits of knowledge graph embeddings in completing monolingual knowledge graphs. Inasmuch as related knowledge bases are built in several different languages, achieving cross-lingual knowledge alignment will help people in constructing a coherent knowledge base, and assist machines in dealing with different expressions of entity relationships across diverse human languages. Unfortunately, achieving this highly desirable crosslingual alignment by human labor is very costly and errorprone. Thus, we propose MTransE, a translation-based model for multilingual knowledge graph embeddings, to provide a simple and automated solution. By encoding entities and relations of each language in a separated embedding space, MTransE provides transitions for each embedding vector to its cross-lingual counterparts in other spaces, while preserving the functionalities of monolingual embeddings. We deploy three different techniques to represent cross-lingual transitions, namely axis calibration, translation vectors, and linear transformations, and derive five variants for MTransE using different loss functions. Our models can be trained on partially aligned graphs, where just a small portion of triples are aligned with their cross-lingual counterparts. The experiments on cross-lingual entity matching and triple-wise alignment verification show promising results, with some variants consistently outperforming others on different tasks. We also explore how MTransE preserves the key properties of its monolingual counterpart TransE.
△ Less
Submitted 17 May, 2017; v1 submitted 11 November, 2016;
originally announced November 2016.
-
Early Accurate Results for Advanced Analytics on MapReduce
Authors:
Nikolay Laptev,
Kai Zeng,
Carlo Zaniolo
Abstract:
Approximate results based on samples often provide the only way in which advanced analytical applications on very massive data sets can satisfy their time and resource constraints. Unfortunately, methods and tools for the computation of accurate early results are currently not supported in MapReduce-oriented systems although these are intended for `big data'. Therefore, we proposed and implemented…
▽ More
Approximate results based on samples often provide the only way in which advanced analytical applications on very massive data sets can satisfy their time and resource constraints. Unfortunately, methods and tools for the computation of accurate early results are currently not supported in MapReduce-oriented systems although these are intended for `big data'. Therefore, we proposed and implemented a non-parametric extension of Hadoop which allows the incremental computation of early results for arbitrary work-flows, along with reliable on-line estimates of the degree of accuracy achieved so far in the computation. These estimates are based on a technique called bootstrap** that has been widely employed in statistics and can be applied to arbitrary functions and data distributions. In this paper, we describe our Early Accurate Result Library (EARL) for Hadoop that was designed to minimize the changes required to the MapReduce framework. Various tests of EARL of Hadoop are presented to characterize the frequent situations where EARL can provide major speed-ups over the current version of Hadoop.
△ Less
Submitted 30 June, 2012;
originally announced July 2012.
-
Succinct Sampling on Streams
Authors:
Vladimir Braverman,
Rafail Ostrovsky,
Carlo Zaniolo
Abstract:
A streaming model is one where data items arrive over long period of time, either one item at a time or in bursts. Typical tasks include computing various statistics over a sliding window of some fixed time-horizon. What makes the streaming model interesting is that as the time progresses, old items expire and new ones arrive. One of the simplest and central tasks in this model is sampling. That…
▽ More
A streaming model is one where data items arrive over long period of time, either one item at a time or in bursts. Typical tasks include computing various statistics over a sliding window of some fixed time-horizon. What makes the streaming model interesting is that as the time progresses, old items expire and new ones arrive. One of the simplest and central tasks in this model is sampling. That is, the task of maintaining up to $k$ uniformly distributed items from a current time-window as old items expire and new ones arrive. We call sampling algorithms {\bf succinct} if they use provably optimal (up to constant factors) {\bf worst-case} memory to maintain $k$ items (either with or without replacement). We stress that in many applications structures that have {\em expected} succinct representation as the time progresses are not sufficient, as small probability events eventually happen with probability 1. Thus, in this paper we ask the following question: are Succinct Sampling on Streams (or $S^3$-algorithms)possible, and if so for what models? Perhaps somewhat surprisingly, we show that $S^3$-algorithms are possible for {\em all} variants of the problem mentioned above, i.e. both with and without replacement and both for one-at-a-time and bursty arrival models. Finally, we use $S^3$ algorithms to solve various problems in sliding windows model, including frequency moments, counting triangles, entropy and density estimations. For these problems we present \emph{first} solutions with provable worst-case memory guarantees.
△ Less
Submitted 14 April, 2008; v1 submitted 25 February, 2007;
originally announced February 2007.
-
Greedy Algorithms in Datalog
Authors:
Sergio Greco,
Carlo Zaniolo
Abstract:
In the design of algorithms, the greedy paradigm provides a powerful tool for solving efficiently classical computational problems, within the framework of procedural languages. However, expressing these algorithms within the declarative framework of logic-based languages has proven a difficult research challenge. In this paper, we extend the framework of Datalog-like languages to obtain simple…
▽ More
In the design of algorithms, the greedy paradigm provides a powerful tool for solving efficiently classical computational problems, within the framework of procedural languages. However, expressing these algorithms within the declarative framework of logic-based languages has proven a difficult research challenge. In this paper, we extend the framework of Datalog-like languages to obtain simple declarative formulations for such problems, and propose effective implementation techniques to ensure computational complexities comparable to those of procedural formulations. These advances are achieved through the use of the "choice" construct, extended with preference annotations to effect the selection of alternative stable-models and nondeterministic fixpoints. We show that, with suitable storage structures, the differential fixpoint computation of our programs matches the complexity of procedural algorithms in classical search and optimization problems.
△ Less
Submitted 18 December, 2003;
originally announced December 2003.
-
The Deductive Database System LDL++
Authors:
Faiz Arni,
KayLiang Ong,
Shalom Tsur,
Haixun Wang,
Carlo Zaniolo
Abstract:
This paper describes the LDL++ system and the research advances that have enabled its design and development. We begin by discussing the new nonmonotonic and nondeterministic constructs that extend the functionality of the LDL++ language, while preserving its model-theoretic and fixpoint semantics. Then, we describe the execution model and the open architecture designed to support these new cons…
▽ More
This paper describes the LDL++ system and the research advances that have enabled its design and development. We begin by discussing the new nonmonotonic and nondeterministic constructs that extend the functionality of the LDL++ language, while preserving its model-theoretic and fixpoint semantics. Then, we describe the execution model and the open architecture designed to support these new constructs and to facilitate the integration with existing DBMSs and applications. Finally, we describe the lessons learned by using LDL++ on various tested applications, such as middleware and datamining.
△ Less
Submitted 1 February, 2002;
originally announced February 2002.