-
Empowering Small-Scale Knowledge Graphs: A Strategy of Leveraging General-Purpose Knowledge Graphs for Enriched Embeddings
Authors:
Albert Sawczyn,
Jakub Binkowski,
Piotr Bielak,
Tomasz Kajdanowicz
Abstract:
Knowledge-intensive tasks pose a significant challenge for Machine Learning (ML) techniques. Commonly adopted methods, such as Large Language Models (LLMs), often exhibit limitations when applied to such tasks. Nevertheless, there have been notable endeavours to mitigate these challenges, with a significant emphasis on augmenting LLMs through Knowledge Graphs (KGs). While KGs provide many advantag…
▽ More
Knowledge-intensive tasks pose a significant challenge for Machine Learning (ML) techniques. Commonly adopted methods, such as Large Language Models (LLMs), often exhibit limitations when applied to such tasks. Nevertheless, there have been notable endeavours to mitigate these challenges, with a significant emphasis on augmenting LLMs through Knowledge Graphs (KGs). While KGs provide many advantages for representing knowledge, their development costs can deter extensive research and applications. Addressing this limitation, we introduce a framework for enriching embeddings of small-scale domain-specific Knowledge Graphs with well-established general-purpose KGs. Adopting our method, a modest domain-specific KG can benefit from a performance boost in downstream tasks when linked to a substantial general-purpose KG. Experimental evaluations demonstrate a notable enhancement, with up to a 44% increase observed in the Hits@10 metric. This relatively unexplored research direction can catalyze more frequent incorporation of KGs in knowledge-intensive tasks, resulting in more robust, reliable ML implementations, which hallucinates less than prevalent LLM solutions.
Keywords: knowledge graph, knowledge graph completion, entity alignment, representation learning, machine learning
△ Less
Submitted 17 May, 2024;
originally announced May 2024.
-
Representation learning in multiplex graphs: Where and how to fuse information?
Authors:
Piotr Bielak,
Tomasz Kajdanowicz
Abstract:
In recent years, unsupervised and self-supervised graph representation learning has gained popularity in the research community. However, most proposed methods are focused on homogeneous networks, whereas real-world graphs often contain multiple node and edge types. Multiplex graphs, a special type of heterogeneous graphs, possess richer information, provide better modeling capabilities and integr…
▽ More
In recent years, unsupervised and self-supervised graph representation learning has gained popularity in the research community. However, most proposed methods are focused on homogeneous networks, whereas real-world graphs often contain multiple node and edge types. Multiplex graphs, a special type of heterogeneous graphs, possess richer information, provide better modeling capabilities and integrate more detailed data from potentially different sources. The diverse edge types in multiplex graphs provide more context and insights into the underlying processes of representation learning. In this paper, we tackle the problem of learning representations for nodes in multiplex networks in an unsupervised or self-supervised manner. To that end, we explore diverse information fusion schemes performed at different levels of the graph processing pipeline. The detailed analysis and experimental evaluation of various scenarios inspired us to propose improvements in how to construct GNN architectures that deal with multiplex graphs.
△ Less
Submitted 27 February, 2024;
originally announced February 2024.
-
Unveiling the Potential of Probabilistic Embeddings in Self-Supervised Learning
Authors:
Denis Janiak,
Jakub Binkowski,
Piotr Bielak,
Tomasz Kajdanowicz
Abstract:
In recent years, self-supervised learning has played a pivotal role in advancing machine learning by allowing models to acquire meaningful representations from unlabeled data. An intriguing research avenue involves develo** self-supervised models within an information-theoretic framework, but many studies often deviate from the stochasticity assumptions made when deriving their objectives. To ga…
▽ More
In recent years, self-supervised learning has played a pivotal role in advancing machine learning by allowing models to acquire meaningful representations from unlabeled data. An intriguing research avenue involves develo** self-supervised models within an information-theoretic framework, but many studies often deviate from the stochasticity assumptions made when deriving their objectives. To gain deeper insights into this issue, we propose to explicitly model the representation with stochastic embeddings and assess their effects on performance, information compression and potential for out-of-distribution detection. From an information-theoretic perspective, we seek to investigate the impact of probabilistic modeling on the information bottleneck, shedding light on a trade-off between compression and preservation of information in both representation and loss space. Emphasizing the importance of distinguishing between these two spaces, we demonstrate how constraining one can affect the other, potentially leading to performance degradation. Moreover, our findings suggest that introducing an additional bottleneck in the loss space can significantly enhance the ability to detect out-of-distribution examples, only leveraging either representation features or the variance of their underlying distribution.
△ Less
Submitted 27 October, 2023;
originally announced October 2023.
-
Similarity-based Memory Enhanced Joint Entity and Relation Extraction
Authors:
Witold Kosciukiewicz,
Mateusz Wojcik,
Tomasz Kajdanowicz,
Adam Gonczarek
Abstract:
Document-level joint entity and relation extraction is a challenging information extraction problem that requires a unified approach where a single neural network performs four sub-tasks: mention detection, coreference resolution, entity classification, and relation extraction. Existing methods often utilize a sequential multi-task learning approach, in which the arbitral decomposition causes the…
▽ More
Document-level joint entity and relation extraction is a challenging information extraction problem that requires a unified approach where a single neural network performs four sub-tasks: mention detection, coreference resolution, entity classification, and relation extraction. Existing methods often utilize a sequential multi-task learning approach, in which the arbitral decomposition causes the current task to depend only on the previous one, missing the possible existence of the more complex relationships between them. In this paper, we present a multi-task learning framework with bidirectional memory-like dependency between tasks to address those drawbacks and perform the joint problem more accurately. Our empirical studies show that the proposed approach outperforms the existing methods and achieves state-of-the-art results on the BioCreative V CDR corpus.
△ Less
Submitted 14 July, 2023;
originally announced July 2023.
-
Electoral Agitation Data Set: The Use Case of the Polish Election
Authors:
Mateusz Baran,
Mateusz Wójcik,
Piotr Kolebski,
Michał Bernaczyk,
Krzysztof Rajda,
Łukasz Augustyniak,
Tomasz Kajdanowicz
Abstract:
The popularity of social media makes politicians use it for political advertisement. Therefore, social media is full of electoral agitation (electioneering), especially during the election campaigns. The election administration cannot track the spread and quantity of messages that count as agitation under the election code. It addresses a crucial problem, while also uncovering a niche that has not…
▽ More
The popularity of social media makes politicians use it for political advertisement. Therefore, social media is full of electoral agitation (electioneering), especially during the election campaigns. The election administration cannot track the spread and quantity of messages that count as agitation under the election code. It addresses a crucial problem, while also uncovering a niche that has not been effectively targeted so far. Hence, we present the first publicly open data set for detecting electoral agitation in the Polish language. It contains 6,112 human-annotated tweets tagged with four legally conditioned categories. We achieved a 0.66 inter-annotator agreement (Cohen's kappa score). An additional annotator resolved the mismatches between the first two improving the consistency and complexity of the annotation process. The newly created data set was used to fine-tune a Polish Language Model called HerBERT (achieving a 68% F1 score). We also present a number of potential use cases for such data sets and models, enriching the paper with an analysis of the Polish 2020 Presidential Election on Twitter.
△ Less
Submitted 13 July, 2023;
originally announced July 2023.
-
Domain-Agnostic Neural Architecture for Class Incremental Continual Learning in Document Processing Platform
Authors:
Mateusz Wójcik,
Witold Kościukiewicz,
Mateusz Baran,
Tomasz Kajdanowicz,
Adam Gonczarek
Abstract:
Production deployments in complex systems require ML architectures to be highly efficient and usable against multiple tasks. Particularly demanding are classification problems in which data arrives in a streaming fashion and each class is presented separately. Recent methods with stochastic gradient learning have been shown to struggle in such setups or have limitations like memory buffers, and be…
▽ More
Production deployments in complex systems require ML architectures to be highly efficient and usable against multiple tasks. Particularly demanding are classification problems in which data arrives in a streaming fashion and each class is presented separately. Recent methods with stochastic gradient learning have been shown to struggle in such setups or have limitations like memory buffers, and being restricted to specific domains that disable its usage in real-world scenarios. For this reason, we present a fully differentiable architecture based on the Mixture of Experts model, that enables the training of high-performance classifiers when examples from each class are presented separately. We conducted exhaustive experiments that proved its applicability in various domains and ability to learn online in production environments. The proposed technique achieves SOTA results without a memory buffer and clearly outperforms the reference methods.
△ Less
Submitted 11 July, 2023;
originally announced July 2023.
-
Massively Multilingual Corpus of Sentiment Datasets and Multi-faceted Sentiment Classification Benchmark
Authors:
Łukasz Augustyniak,
Szymon Woźniak,
Marcin Gruza,
Piotr Gramacki,
Krzysztof Rajda,
Mikołaj Morzy,
Tomasz Kajdanowicz
Abstract:
Despite impressive advancements in multilingual corpora collection and model training, develo** large-scale deployments of multilingual models still presents a significant challenge. This is particularly true for language tasks that are culture-dependent. One such example is the area of multilingual sentiment analysis, where affective markers can be subtle and deeply ensconced in culture. This w…
▽ More
Despite impressive advancements in multilingual corpora collection and model training, develo** large-scale deployments of multilingual models still presents a significant challenge. This is particularly true for language tasks that are culture-dependent. One such example is the area of multilingual sentiment analysis, where affective markers can be subtle and deeply ensconced in culture. This work presents the most extensive open massively multilingual corpus of datasets for training sentiment models. The corpus consists of 79 manually selected datasets from over 350 datasets reported in the scientific literature based on strict quality criteria. The corpus covers 27 languages representing 6 language families. Datasets can be queried using several linguistic and functional features. In addition, we present a multi-faceted sentiment classification benchmark summarizing hundreds of experiments conducted on different base models, training objectives, dataset collections, and fine-tuning strategies.
△ Less
Submitted 13 June, 2023;
originally announced June 2023.
-
Graph-level representations using ensemble-based readout functions
Authors:
Jakub Binkowski,
Albert Sawczyn,
Denis Janiak,
Piotr Bielak,
Tomasz Kajdanowicz
Abstract:
Graph machine learning models have been successfully deployed in a variety of application areas. One of the most prominent types of models - Graph Neural Networks (GNNs) - provides an elegant way of extracting expressive node-level representation vectors, which can be used to solve node-related problems, such as classifying users in a social network. However, many tasks require representations at…
▽ More
Graph machine learning models have been successfully deployed in a variety of application areas. One of the most prominent types of models - Graph Neural Networks (GNNs) - provides an elegant way of extracting expressive node-level representation vectors, which can be used to solve node-related problems, such as classifying users in a social network. However, many tasks require representations at the level of the whole graph, e.g., molecular applications. In order to convert node-level representations into a graph-level vector, a so-called readout function must be applied. In this work, we study existing readout methods, including simple non-trainable ones, as well as complex, parametrized models. We introduce a concept of ensemble-based readout functions that combine either representations or predictions. Our experiments show that such ensembles allow for better performance than simple single readouts or similar performance as the complex, parametrized ones, but at a fraction of the model complexity.
△ Less
Submitted 20 April, 2023; v1 submitted 3 March, 2023;
originally announced March 2023.
-
RAFEN -- Regularized Alignment Framework for Embeddings of Nodes
Authors:
Kamil Tagowski,
Piotr Bielak,
Jakub Binkowski,
Tomasz Kajdanowicz
Abstract:
Learning representations of nodes has been a crucial area of the graph machine learning research area. A well-defined node embedding model should reflect both node features and the graph structure in the final embedding. In the case of dynamic graphs, this problem becomes even more complex as both features and structure may change over time. The embeddings of particular nodes should remain compara…
▽ More
Learning representations of nodes has been a crucial area of the graph machine learning research area. A well-defined node embedding model should reflect both node features and the graph structure in the final embedding. In the case of dynamic graphs, this problem becomes even more complex as both features and structure may change over time. The embeddings of particular nodes should remain comparable during the evolution of the graph, what can be achieved by applying an alignment procedure. This step was often applied in existing works after the node embedding was already computed. In this paper, we introduce a framework -- RAFEN -- that allows to enrich any existing node embedding method using the aforementioned alignment term and learning aligned node embedding during training time. We propose several variants of our framework and demonstrate its performance on six real-world datasets. RAFEN achieves on-par or better performance than existing approaches without requiring additional processing steps.
△ Less
Submitted 19 April, 2023; v1 submitted 3 March, 2023;
originally announced March 2023.
-
Neural Architecture for Online Ensemble Continual Learning
Authors:
Mateusz Wójcik,
Witold Kościukiewicz,
Tomasz Kajdanowicz,
Adam Gonczarek
Abstract:
Continual learning with an increasing number of classes is a challenging task. The difficulty rises when each example is presented exactly once, which requires the model to learn online. Recent methods with classic parameter optimization procedures have been shown to struggle in such setups or have limitations like non-differentiable components or memory buffers. For this reason, we present the fu…
▽ More
Continual learning with an increasing number of classes is a challenging task. The difficulty rises when each example is presented exactly once, which requires the model to learn online. Recent methods with classic parameter optimization procedures have been shown to struggle in such setups or have limitations like non-differentiable components or memory buffers. For this reason, we present the fully differentiable ensemble method that allows us to efficiently train an ensemble of neural networks in the end-to-end regime. The proposed technique achieves SOTA results without a memory buffer and clearly outperforms the reference methods. The conducted experiments have also shown a significant increase in the performance for small ensembles, which demonstrates the capability of obtaining relatively high classification accuracy with a reduced number of classifiers.
△ Less
Submitted 21 August, 2023; v1 submitted 27 November, 2022;
originally announced November 2022.
-
This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish
Authors:
Łukasz Augustyniak,
Kamil Tagowski,
Albert Sawczyn,
Denis Janiak,
Roman Bartusiak,
Adrian Szymczak,
Marcin Wątroba,
Arkadiusz Janz,
Piotr Szymański,
Mikołaj Morzy,
Tomasz Kajdanowicz,
Maciej Piasecki
Abstract:
The availability of compute and data to train larger and larger language models increases the demand for robust methods of benchmarking the true progress of LM training. Recent years witnessed significant progress in standardized benchmarking for English. Benchmarks such as GLUE, SuperGLUE, or KILT have become de facto standard tools to compare large language models. Following the trend to replica…
▽ More
The availability of compute and data to train larger and larger language models increases the demand for robust methods of benchmarking the true progress of LM training. Recent years witnessed significant progress in standardized benchmarking for English. Benchmarks such as GLUE, SuperGLUE, or KILT have become de facto standard tools to compare large language models. Following the trend to replicate GLUE for other languages, the KLEJ benchmark has been released for Polish. In this paper, we evaluate the progress in benchmarking for low-resourced languages. We note that only a handful of languages have such comprehensive benchmarks. We also note the gap in the number of tasks being evaluated by benchmarks for resource-rich English/Chinese and the rest of the world. In this paper, we introduce LEPISZCZE (the Polish word for glew, the Middle English predecessor of glue), a new, comprehensive benchmark for Polish NLP with a large variety of tasks and high-quality operationalization of the benchmark. We design LEPISZCZE with flexibility in mind. Including new models, datasets, and tasks is as simple as possible while still offering data versioning and model tracking. In the first run of the benchmark, we test 13 experiments (task and dataset pairs) based on the five most recent LMs for Polish. We use five datasets from the Polish benchmark and add eight novel datasets. As the paper's main contribution, apart from LEPISZCZE, we provide insights and experiences learned while creating the benchmark for Polish as the blueprint to design similar benchmarks for other low-resourced languages.
△ Less
Submitted 23 November, 2022;
originally announced November 2022.
-
Assessment of Massively Multilingual Sentiment Classifiers
Authors:
Krzysztof Rajda,
Łukasz Augustyniak,
Piotr Gramacki,
Marcin Gruza,
Szymon Woźniak,
Tomasz Kajdanowicz
Abstract:
Models are increasing in size and complexity in the hunt for SOTA. But what if those 2\% increase in performance does not make a difference in a production use case? Maybe benefits from a smaller, faster model outweigh those slight performance gains. Also, equally good performance across languages in multilingual tasks is more important than SOTA results on a single one. We present the biggest, un…
▽ More
Models are increasing in size and complexity in the hunt for SOTA. But what if those 2\% increase in performance does not make a difference in a production use case? Maybe benefits from a smaller, faster model outweigh those slight performance gains. Also, equally good performance across languages in multilingual tasks is more important than SOTA results on a single one. We present the biggest, unified, multilingual collection of sentiment analysis datasets. We use these to assess 11 models and 80 high-quality sentiment datasets (out of 342 raw datasets collected) in 27 languages and included results on the internally annotated datasets. We deeply evaluate multiple setups, including fine-tuning transformer-based models for measuring performance. We compare results in numerous dimensions addressing the imbalance in both languages coverage and dataset sizes. Finally, we present some best practices for working with such a massive collection of datasets and models from a multilingual perspective.
△ Less
Submitted 11 April, 2022;
originally announced April 2022.
-
Dynamic pricing and discounts by means of interactive presentation systems in stationary point of sales
Authors:
Marcin Lewicki,
Tomasz Kajdanowicz,
Piotr Bródka,
Janusz Sobecki
Abstract:
The main purpose of this article was to create a model and simulate the profitability conditions of an interactive presentation system (IPS) with the recommender system (RS) used in the kiosk. 90 million simulations have been run in Python with SymPy to address the problem of discount recommendation offered to the clients according to their usage of the IPS.
The main purpose of this article was to create a model and simulate the profitability conditions of an interactive presentation system (IPS) with the recommender system (RS) used in the kiosk. 90 million simulations have been run in Python with SymPy to address the problem of discount recommendation offered to the clients according to their usage of the IPS.
△ Less
Submitted 28 January, 2022;
originally announced January 2022.
-
Spatial Data Mining of Public Transport Incidents reported in Social Media
Authors:
Kamil Raczycki,
Marcin Szymański,
Yahor Yeliseyenka,
Piotr Szymański,
Tomasz Kajdanowicz
Abstract:
Public transport agencies use social media as an essential tool for communicating mobility incidents to passengers. However, while the short term, day-to-day information about transport phenomena is usually posted in social media with low latency, its availability is short term as the content is rarely made an aggregated form. Social media communication of transport phenomena usually lacks GIS ann…
▽ More
Public transport agencies use social media as an essential tool for communicating mobility incidents to passengers. However, while the short term, day-to-day information about transport phenomena is usually posted in social media with low latency, its availability is short term as the content is rarely made an aggregated form. Social media communication of transport phenomena usually lacks GIS annotations as most social media platforms do not allow attaching non-POI GPS coordinates to posts. As a result, the analysis of transport phenomena information is minimal. We collected three years of social media posts of a polish public transport company with user comments. Through exploration, we infer a six-class transport information typology. We successfully build an information type classifier for social media posts, detect stop names in posts, and relate them to GPS coordinates, obtaining a spatial understanding of long-term aggregated phenomena. We show that our approach enables citizen science and use it to analyze the impact of three years of infrastructure incidents on passenger mobility, and the sentiment and reaction scale towards each of the events. All these results are achieved for Polish, an under-resourced language when it comes to spatial language understanding, especially in social media contexts. To improve the situation, we released two of our annotated data sets: social media posts with incident type labels and matched stop names and social media comments with the annotated sentiment. We also opensource the experimental codebase.
△ Less
Submitted 11 October, 2021;
originally announced October 2021.
-
Graph Barlow Twins: A self-supervised representation learning framework for graphs
Authors:
Piotr Bielak,
Tomasz Kajdanowicz,
Nitesh V. Chawla
Abstract:
The self-supervised learning (SSL) paradigm is an essential exploration area, which tries to eliminate the need for expensive data labeling. Despite the great success of SSL methods in computer vision and natural language processing, most of them employ contrastive learning objectives that require negative samples, which are hard to define. This becomes even more challenging in the case of graphs…
▽ More
The self-supervised learning (SSL) paradigm is an essential exploration area, which tries to eliminate the need for expensive data labeling. Despite the great success of SSL methods in computer vision and natural language processing, most of them employ contrastive learning objectives that require negative samples, which are hard to define. This becomes even more challenging in the case of graphs and is a bottleneck for achieving robust representations. To overcome such limitations, we propose a framework for self-supervised graph representation learning - Graph Barlow Twins, which utilizes a cross-correlation-based loss function instead of negative samples. Moreover, it does not rely on non-symmetric neural network architectures - in contrast to state-of-the-art self-supervised graph representation learning method BGRL. We show that our method achieves as competitive results as the best self-supervised methods and fully supervised ones while requiring fewer hyperparameters and substantially shorter computation time (ca. 30 times faster than BGRL).
△ Less
Submitted 12 September, 2023; v1 submitted 4 June, 2021;
originally announced June 2021.
-
AttrE2vec: Unsupervised Attributed Edge Representation Learning
Authors:
Piotr Bielak,
Tomasz Kajdanowicz,
Nitesh V. Chawla
Abstract:
Representation learning has overcome the often arduous and manual featurization of networks through (unsupervised) feature learning as it results in embeddings that can apply to a variety of downstream learning tasks. The focus of representation learning on graphs has focused mainly on shallow (node-centric) or deep (graph-based) learning approaches. While there have been approaches that work on h…
▽ More
Representation learning has overcome the often arduous and manual featurization of networks through (unsupervised) feature learning as it results in embeddings that can apply to a variety of downstream learning tasks. The focus of representation learning on graphs has focused mainly on shallow (node-centric) or deep (graph-based) learning approaches. While there have been approaches that work on homogeneous and heterogeneous networks with multi-typed nodes and edges, there is a gap in learning edge representations. This paper proposes a novel unsupervised inductive method called AttrE2Vec, which learns a low-dimensional vector representation for edges in attributed networks. It systematically captures the topological proximity, attributes affinity, and feature similarity of edges. Contrary to current advances in edge embedding research, our proposal extends the body of methods providing representations for edges, capturing graph attributes in an inductive and unsupervised manner. Experimental results show that, compared to contemporary approaches, our method builds more powerful edge vector representations, reflected by higher quality measures (AUC, accuracy) in downstream tasks as edge classification and edge clustering. It is also confirmed by analyzing low-dimensional embedding projections.
△ Less
Submitted 29 December, 2020;
originally announced December 2020.
-
Political Advertising Dataset: the use case of the Polish 2020 Presidential Elections
Authors:
Łukasz Augustyniak,
Krzysztof Rajda,
Tomasz Kajdanowicz,
Michał Bernaczyk
Abstract:
Political campaigns are full of political ads posted by candidates on social media. Political advertisements constitute a basic form of campaigning, subjected to various social requirements. We present the first publicly open dataset for detecting specific text chunks and categories of political advertising in the Polish language. It contains 1,705 human-annotated tweets tagged with nine categorie…
▽ More
Political campaigns are full of political ads posted by candidates on social media. Political advertisements constitute a basic form of campaigning, subjected to various social requirements. We present the first publicly open dataset for detecting specific text chunks and categories of political advertising in the Polish language. It contains 1,705 human-annotated tweets tagged with nine categories, which constitute campaigning under Polish electoral law. We achieved a 0.65 inter-annotator agreement (Cohen's kappa score). An additional annotator resolved the mismatches between the first two annotators improving the consistency and complexity of the annotation process. We used the newly created dataset to train a well established neural tagger (achieving a 70% percent points F1 score). We also present a possible direction of use cases for such datasets and models with an initial analysis of the Polish 2020 Presidential Elections on Twitter.
△ Less
Submitted 17 June, 2020;
originally announced June 2020.
-
UCSG-Net -- Unsupervised Discovering of Constructive Solid Geometry Tree
Authors:
Kacper Kania,
Maciej Zięba,
Tomasz Kajdanowicz
Abstract:
Signed distance field (SDF) is a prominent implicit representation of 3D meshes. Methods that are based on such representation achieved state-of-the-art 3D shape reconstruction quality. However, these methods struggle to reconstruct non-convex shapes. One remedy is to incorporate a constructive solid geometry framework (CSG) that represents a shape as a decomposition into primitives. It allows to…
▽ More
Signed distance field (SDF) is a prominent implicit representation of 3D meshes. Methods that are based on such representation achieved state-of-the-art 3D shape reconstruction quality. However, these methods struggle to reconstruct non-convex shapes. One remedy is to incorporate a constructive solid geometry framework (CSG) that represents a shape as a decomposition into primitives. It allows to embody a 3D shape of high complexity and non-convexity with a simple tree representation of Boolean operations. Nevertheless, existing approaches are supervised and require the entire CSG parse tree that is given upfront during the training process. On the contrary, we propose a model that extracts a CSG parse tree without any supervision - UCSG-Net. Our model predicts parameters of primitives and binarizes their SDF representation through differentiable indicator function. It is achieved jointly with discovering the structure of a Boolean operators tree. The model selects dynamically which operator combination over primitives leads to the reconstruction of high fidelity. We evaluate our method on 2D and 3D autoencoding tasks. We show that the predicted parse tree representation is interpretable and can be used in CAD software.
△ Less
Submitted 20 October, 2020; v1 submitted 16 June, 2020;
originally announced June 2020.
-
Comprehensive Analysis of Aspect Term Extraction Methods using Various Text Embeddings
Authors:
Łukasz Augustyniak,
Tomasz Kajdanowicz,
Przemysław Kazienko
Abstract:
Recently, a variety of model designs and methods have blossomed in the context of the sentiment analysis domain. However, there is still a lack of wide and comprehensive studies of aspect-based sentiment analysis (ABSA). We want to fill this gap and propose a comparison with ablation analysis of aspect term extraction using various text embedding methods. We particularly focused on architectures b…
▽ More
Recently, a variety of model designs and methods have blossomed in the context of the sentiment analysis domain. However, there is still a lack of wide and comprehensive studies of aspect-based sentiment analysis (ABSA). We want to fill this gap and propose a comparison with ablation analysis of aspect term extraction using various text embedding methods. We particularly focused on architectures based on long short-term memory (LSTM) with optional conditional random field (CRF) enhancement using different pre-trained word embeddings. Moreover, we analyzed the influence on the performance of extending the word vectorization step with character embedding. The experimental results on SemEval datasets revealed that not only does bi-directional long short-term memory (BiLSTM) outperform regular LSTM, but also word embedding coverage and its source highly affect aspect detection performance. An additional CRF layer consistently improves the results as well.
△ Less
Submitted 10 December, 2020; v1 submitted 11 September, 2019;
originally announced September 2019.
-
Extracting Aspects Hierarchies using Rhetorical Structure Theory
Authors:
Łukasz Augustyniak,
Tomasz Kajdanowicz,
Przemysław Kazienko
Abstract:
We propose a novel approach to generate aspect hierarchies that proved to be consistently correct compared with human-generated hierarchies. We present an unsupervised technique using Rhetorical Structure Theory and graph analysis. We evaluated our approach based on 100,000 reviews from Amazon and achieved an astonishing 80% coverage compared with human-generated hierarchies coded in ConceptNet. T…
▽ More
We propose a novel approach to generate aspect hierarchies that proved to be consistently correct compared with human-generated hierarchies. We present an unsupervised technique using Rhetorical Structure Theory and graph analysis. We evaluated our approach based on 100,000 reviews from Amazon and achieved an astonishing 80% coverage compared with human-generated hierarchies coded in ConceptNet. The method could be easily extended with a sentiment analysis model and used to describe sentiment on different levels of aspect granularity. Hence, besides the flat aspect structure, we can differentiate between aspects and describe if the charging aspect is related to battery or price.
△ Less
Submitted 4 September, 2019;
originally announced September 2019.
-
Aspect Detection using Word and Char Embeddings with (Bi)LSTM and CRF
Authors:
Łukasz Augustyniak,
Tomasz Kajdanowicz,
Przemysław Kazienko
Abstract:
We proposed a~new accurate aspect extraction method that makes use of both word and character-based embeddings. We have conducted experiments of various models of aspect extraction using LSTM and BiLSTM including CRF enhancement on five different pre-trained word embeddings extended with character embeddings. The results revealed that BiLSTM outperforms regular LSTM, but also word embedding covera…
▽ More
We proposed a~new accurate aspect extraction method that makes use of both word and character-based embeddings. We have conducted experiments of various models of aspect extraction using LSTM and BiLSTM including CRF enhancement on five different pre-trained word embeddings extended with character embeddings. The results revealed that BiLSTM outperforms regular LSTM, but also word embedding coverage in train and test sets profoundly impacted aspect detection performance. Moreover, the additional CRF layer consistently improves the results across different models and text embeddings. Summing up, we obtained state-of-the-art F-score results for SemEval Restaurants (85%) and Laptops (80%).
△ Less
Submitted 3 September, 2019;
originally announced September 2019.
-
FILDNE: A Framework for Incremental Learning of Dynamic Networks Embeddings
Authors:
Piotr Bielak,
Kamil Tagowski,
Maciej Falkiewicz,
Tomasz Kajdanowicz,
Nitesh V. Chawla
Abstract:
Representation learning on graphs has emerged as a powerful mechanism to automate feature vector generation for downstream machine learning tasks. The advances in representation on graphs have centered on both homogeneous and heterogeneous graphs, where the latter presenting the challenges associated with multi-typed nodes and/or edges. In this paper, we consider the additional challenge of evolvi…
▽ More
Representation learning on graphs has emerged as a powerful mechanism to automate feature vector generation for downstream machine learning tasks. The advances in representation on graphs have centered on both homogeneous and heterogeneous graphs, where the latter presenting the challenges associated with multi-typed nodes and/or edges. In this paper, we consider the additional challenge of evolving graphs. We ask the question of whether the advances in representation learning for static graphs can be leveraged for dynamic graphs and how? It is important to be able to incorporate those advances to maximize the utility and generalization of methods. To that end, we propose the Framework for Incremental Learning of Dynamic Networks Embedding (FILDNE), which can utilize any existing static representation learning method for learning node embeddings, while kee** the computational costs low. FILDNE integrates the feature vectors computed using the standard methods over different timesteps into a single representation by develo** a convex combination function and alignment mechanism. Experimental results on several downstream tasks, over seven real-world data sets, show that FILDNE is able to reduce memory and computational time costs while providing competitive quality measure gains with respect to the contemporary methods for representation learning on dynamic graphs.
△ Less
Submitted 19 November, 2020; v1 submitted 6 April, 2019;
originally announced April 2019.
-
LNEMLC: Label Network Embeddings for Multi-Label Classification
Authors:
Piotr Szymański,
Tomasz Kajdanowicz,
Nitesh Chawla
Abstract:
Multi-label classification aims to classify instances with discrete non-exclusive labels. Most approaches on multi-label classification focus on effective adaptation or transformation of existing binary and multi-class learning approaches but fail in modelling the joint probability of labels or do not preserve generalization abilities for unseen label combinations. To address these issues we propo…
▽ More
Multi-label classification aims to classify instances with discrete non-exclusive labels. Most approaches on multi-label classification focus on effective adaptation or transformation of existing binary and multi-class learning approaches but fail in modelling the joint probability of labels or do not preserve generalization abilities for unseen label combinations. To address these issues we propose a new multi-label classification scheme, LNEMLC - Label Network Embedding for Multi-Label Classification, that embeds the label network and uses it to extend input space in learning and inference of any base multi-label classifier. The approach allows capturing of labels' joint probability at low computational complexity providing results comparable to the best methods reported in the literature. We demonstrate how the method reveals statistically significant improvements over the simple kNN baseline classifier. We also provide hints for selecting the robust configuration that works satisfactorily across data domains.
△ Less
Submitted 1 January, 2019; v1 submitted 7 December, 2018;
originally announced December 2018.
-
Graph Energies of Egocentric Networks and Their Correlation with Vertex Centrality Measures
Authors:
Mikołaj Morzy,
Tomasz Kajdanowicz
Abstract:
Graph energy is the energy of the matrix representation of the graph, where the energy of a matrix is the sum of singular values of the matrix. Depending on the definition of a matrix, one can contemplate graph energy, Randić energy, Laplacian energy, distance energy, and many others. Although theoretical properties of various graph energies have been investigated in the past in the areas of mathe…
▽ More
Graph energy is the energy of the matrix representation of the graph, where the energy of a matrix is the sum of singular values of the matrix. Depending on the definition of a matrix, one can contemplate graph energy, Randić energy, Laplacian energy, distance energy, and many others. Although theoretical properties of various graph energies have been investigated in the past in the areas of mathematics, chemistry, physics, or graph theory, these explorations have been limited to relatively small graphs representing chemical compounds or theoretical graph classes with strictly defined properties. In this paper we investigate the usefulness of the concept of graph energy in the context of large, complex networks. We show that when graph energies are applied to local egocentric networks, the values of these energies correlate strongly with vertex centrality measures. In particular, for some generative network models graph energies tend to correlate strongly with the betweenness and the eigencentrality of vertices. As the exact computation of these centrality measures is expensive and requires global processing of a network, our research opens the possibility of devising efficient algorithms for the estimation of these centrality measures based only on local information.
△ Less
Submitted 12 November, 2018; v1 submitted 31 August, 2018;
originally announced September 2018.
-
Priority Attachment: a Comprehensive Mechanism for Generating Networks
Authors:
Mikołaj Morzy,
Tomasz Kajdanowicz,
Przemysław Kazienko,
Grzegorz Miebs,
Arkadiusz Rusin
Abstract:
We claim that networks are created according to the priority attachment mechanism and we show a simple model which uses the priority attachment to generate both synthetic and close to empirical networks. Priority attachment is a mechanism which generalizes previously proposed mechanisms, such as small world creation or preferential attachment, but we also observe its presence in a range of real-wo…
▽ More
We claim that networks are created according to the priority attachment mechanism and we show a simple model which uses the priority attachment to generate both synthetic and close to empirical networks. Priority attachment is a mechanism which generalizes previously proposed mechanisms, such as small world creation or preferential attachment, but we also observe its presence in a range of real-world networks. In this paper we show that by using priority attachment we can generate networks of very diverse topologies, as well as recreate empirical networks. An additional advantage of the priority attachment mechanism is an easy interpretation of the latent processes of network formation. We substantiate our claims by performing numerical experiments on synthetic and empirical networks. The two main contributions of the paper are: the introduction of the priority attachment mechanism, and the design of the Priority Rank: a simple network generative model based on the priority attachment mechanism.
△ Less
Submitted 20 June, 2018; v1 submitted 10 January, 2018;
originally announced January 2018.
-
Method for Aspect-Based Sentiment Annotation Using Rhetorical Analysis
Authors:
Łukasz Augustyniak,
Krzysztof Rajda,
Tomasz Kajdanowicz
Abstract:
This paper fills a gap in aspect-based sentiment analysis and aims to present a new method for preparing and analysing texts concerning opinion and generating user-friendly descriptive reports in natural language. We present a comprehensive set of techniques derived from Rhetorical Structure Theory and sentiment analysis to extract aspects from textual opinions and then build an abstractive summar…
▽ More
This paper fills a gap in aspect-based sentiment analysis and aims to present a new method for preparing and analysing texts concerning opinion and generating user-friendly descriptive reports in natural language. We present a comprehensive set of techniques derived from Rhetorical Structure Theory and sentiment analysis to extract aspects from textual opinions and then build an abstractive summary of a set of opinions. Moreover, we propose aspect-aspect graphs to evaluate the importance of aspects and to filter out unimportant ones from the summary. Additionally, the paper presents a prototype solution of data flow with interesting and valuable results. The proposed method's results proved the high accuracy of aspect detection when applied to the gold standard dataset.
△ Less
Submitted 13 September, 2017;
originally announced September 2017.
-
Spatio-temporal profiling of public transport delays based on large scale vehicle positioning data from GPS in Wrocław
Authors:
Piotr Szymański,
Michał Żołnieruk,
Piotr Oleszczyk,
Igor Gisterek,
Tomasz Kajdanowicz
Abstract:
In recent years many studies of urban mobility based on large data sets have been published: most of them based on crowdsourced GPS data or smart-card data. We present, what is to our knowledge the first, exploration of public transport delay data harvested from a large-scale, official public transport positioning system, provided by the Wrocław Municipality. We evaluate the characteristics of del…
▽ More
In recent years many studies of urban mobility based on large data sets have been published: most of them based on crowdsourced GPS data or smart-card data. We present, what is to our knowledge the first, exploration of public transport delay data harvested from a large-scale, official public transport positioning system, provided by the Wrocław Municipality. We evaluate the characteristics of delays between stops in relation to direction, time and delay variance of 1648 stop pairs from 15 mln delay reports. We construct a normalized feature matrix of likelihood of a given delay change happening at a given hour on the edge between two stops. We then calculate distances between such matrices using earth mover's distance and cluster them using hierarchical agglomerative clustering with Ward's linkage method. We obtain four profiles of delay changes in Wrocław: edges without impact on delay, edges likely to cause delay, edges likely to decrease delay and edges likely to strongly decrease delay (ex. when a public transport vehicle is speeding). We analyze the spatial and mode of transport properties of each cluster and provide insights into reasons of delay change patterns in each of the detected profiles.
△ Less
Submitted 25 July, 2017;
originally announced July 2017.
-
A Network Perspective on Stratification of Multi-Label Data
Authors:
Piotr Szymański,
Tomasz Kajdanowicz
Abstract:
In the recent years, we have witnessed the development of multi-label classification methods which utilize the structure of the label space in a divide and conquer approach to improve classification performance and allow large data sets to be classified efficiently. Yet most of the available data sets have been provided in train/test splits that did not account for maintaining a distribution of hi…
▽ More
In the recent years, we have witnessed the development of multi-label classification methods which utilize the structure of the label space in a divide and conquer approach to improve classification performance and allow large data sets to be classified efficiently. Yet most of the available data sets have been provided in train/test splits that did not account for maintaining a distribution of higher-order relationships between labels among splits or folds. We present a new approach to stratifying multi-label data for classification purposes based on the iterative stratification approach proposed by Sechidis et. al. in an ECML PKDD 2011 paper. Our method extends the iterative approach to take into account second-order relationships between labels. Obtained results are evaluated using statistical properties of obtained strata as presented by Sechidis. We also propose new statistical measures relevant to second-order quality: label pairs distribution, the percentage of label pairs without positive evidence in folds and label pair - fold pairs that have no positive evidence for the label pair. We verify the impact of new methods on classification performance of Binary Relevance, Label Powerset and a fast greedy community detection based label space partitioning classifier. Random Forests serve as base classifiers. We check the variation of the number of communities obtained per fold, and the stability of their modularity score. Second-Order Iterative Stratification is compared to standard k-fold, label set, and iterative stratification. The proposed approach lowers the variance of classification quality, improves label pair oriented measures and example distribution while maintaining a competitive quality in label-oriented measures. We also witness an increase in stability of network characteristics.
△ Less
Submitted 27 April, 2017;
originally announced April 2017.
-
Is a Data-Driven Approach still Better than Random Choice with Naive Bayes classifiers?
Authors:
Piotr Szymański,
Tomasz Kajdanowicz
Abstract:
We study the performance of data-driven, a priori and random approaches to label space partitioning for multi-label classification with a Gaussian Naive Bayes classifier. Experiments were performed on 12 benchmark data sets and evaluated on 5 established measures of classification quality: micro and macro averaged F1 score, Subset Accuracy and Hamming loss. Data-driven methods are significantly be…
▽ More
We study the performance of data-driven, a priori and random approaches to label space partitioning for multi-label classification with a Gaussian Naive Bayes classifier. Experiments were performed on 12 benchmark data sets and evaluated on 5 established measures of classification quality: micro and macro averaged F1 score, Subset Accuracy and Hamming loss. Data-driven methods are significantly better than an average run of the random baseline. In case of F1 scores and Subset Accuracy - data driven approaches were more likely to perform better than random approaches than otherwise in the worst case. There always exists a method that performs better than a priori methods in the worst case. The advantage of data-driven methods against a priori methods with a weak classifier is lesser than when tree classifiers are used.
△ Less
Submitted 13 February, 2017;
originally announced February 2017.
-
A scikit-based Python environment for performing multi-label classification
Authors:
Piotr Szymański,
Tomasz Kajdanowicz
Abstract:
scikit-multilearn is a Python library for performing multi-label classification. The library is compatible with the scikit/scipy ecosystem and uses sparse matrices for all internal operations. It provides native Python implementations of popular multi-label classification methods alongside a novel framework for label space partitioning and division. It includes modern algorithm adaptation methods,…
▽ More
scikit-multilearn is a Python library for performing multi-label classification. The library is compatible with the scikit/scipy ecosystem and uses sparse matrices for all internal operations. It provides native Python implementations of popular multi-label classification methods alongside a novel framework for label space partitioning and division. It includes modern algorithm adaptation methods, network-based label space division approaches, which extracts label dependency information and multi-label embedding classifiers. It provides python wrapped access to the extensive multi-label method stack from Java libraries and makes it possible to extend deep learning single-label methods for multi-label tasks. The library allows multi-label stratification and data set management. The implementation is more efficient in problem transformation than other established libraries, has good test coverage and follows PEP8. Source code and documentation can be downloaded from http://scikit.ml and also via pip. The library follows BSD licensing scheme.
△ Less
Submitted 10 December, 2018; v1 submitted 5 February, 2017;
originally announced February 2017.
-
Balancing Speed and Coverage by Sequential Seeding in Complex Networks
Authors:
Jarosław Jankowski,
Piotr Bródka,
Przemysław Kazienko,
Boleslaw Szymanski,
Radosław Michalski,
Tomasz Kajdanowicz
Abstract:
Information spreading in complex networks is often modeled as diffusing information with certain probability from nodes that possess it to their neighbors that do not. Information cascades are triggered when the activation of a set of initial nodes (seeds) results in diffusion to large number of nodes. Here, several novel approaches for seed initiation that replace the commonly used activation of…
▽ More
Information spreading in complex networks is often modeled as diffusing information with certain probability from nodes that possess it to their neighbors that do not. Information cascades are triggered when the activation of a set of initial nodes (seeds) results in diffusion to large number of nodes. Here, several novel approaches for seed initiation that replace the commonly used activation of all seeds at once with a sequence of initiation stages are introduced. Sequential strategies at later stages avoid seeding highly ranked nodes that are already activated by diffusion active between stages. The gain arises when a saved seed is allocated to a node difficult to reach via diffusion. Sequential seeding and a single stage approach are compared using various seed ranking methods and diffusion parameters on real complex networks. The experimental results indicate that, regardless of the seed ranking method used, sequential seeding strategies deliver better coverage than single stage seeding in about 90% of cases. Longer seeding sequences tend to activate more nodes but they also extend the duration of diffusion. Various variants of sequential seeding resolve the trade-off between the coverage and speed of diffusion differently.
△ Less
Submitted 12 January, 2017; v1 submitted 23 September, 2016;
originally announced September 2016.
-
WordNet2Vec: Corpora Agnostic Word Vectorization Method
Authors:
Roman Bartusiak,
Łukasz Augustyniak,
Tomasz Kajdanowicz,
Przemysław Kazienko,
Maciej Piasecki
Abstract:
A complex nature of big data resources demands new methods for structuring especially for textual content. WordNet is a good knowledge source for comprehensive abstraction of natural language as its good implementations exist for many languages. Since WordNet embeds natural language in the form of a complex network, a transformation mechanism WordNet2Vec is proposed in the paper. It creates vector…
▽ More
A complex nature of big data resources demands new methods for structuring especially for textual content. WordNet is a good knowledge source for comprehensive abstraction of natural language as its good implementations exist for many languages. Since WordNet embeds natural language in the form of a complex network, a transformation mechanism WordNet2Vec is proposed in the paper. It creates vectors for each word from WordNet. These vectors encapsulate general position - role of a given word towards all other words in the natural language. Any list or set of such vectors contains knowledge about the context of its component within the whole language. Such word representation can be easily applied to many analytic tasks like classification or clustering. The usefulness of the WordNet2Vec method was demonstrated in sentiment analysis, i.e. classification with transfer learning for the real Amazon opinion textual dataset.
△ Less
Submitted 10 June, 2016;
originally announced June 2016.
-
How is a data-driven approach better than random choice in label space division for multi-label classification?
Authors:
Piotr Szymański,
Tomasz Kajdanowicz,
Kristian Kersting
Abstract:
We propose using five data-driven community detection approaches from social networks to partition the label space for the task of multi-label classification as an alternative to random partitioning into equal subsets as performed by RAkELd: modularity-maximizing fastgreedy and leading eigenvector, infomap, walktrap and label propagation algorithms. We construct a label co-occurence graph (both we…
▽ More
We propose using five data-driven community detection approaches from social networks to partition the label space for the task of multi-label classification as an alternative to random partitioning into equal subsets as performed by RAkELd: modularity-maximizing fastgreedy and leading eigenvector, infomap, walktrap and label propagation algorithms. We construct a label co-occurence graph (both weighted an unweighted versions) based on training data and perform community detection to partition the label set. We include Binary Relevance and Label Powerset classification methods for comparison. We use gini-index based Decision Trees as the base classifier. We compare educated approaches to label space divisions against random baselines on 12 benchmark data sets over five evaluation measures. We show that in almost all cases seven educated guess approaches are more likely to outperform RAkELd than otherwise in all measures, but Hamming Loss. We show that fastgreedy and walktrap community detection methods on weighted label co-occurence graphs are 85-92% more likely to yield better F1 scores than random partitioning. Infomap on the unweighted label co-occurence graphs is on average 90% of the times better than random paritioning in terms of Subset Accuracy and 89% when it comes to Jaccard similarity. Weighted fastgreedy is better on average than RAkELd when it comes to Hamming Loss.
△ Less
Submitted 7 June, 2016;
originally announced June 2016.
-
Learning in Unlabeled Networks - An Active Learning and Inference Approach
Authors:
Tomasz Kajdanowicz,
Radosław Michalski,
Katarzyna Musiał,
Przemysław Kazienko
Abstract:
The task of determining labels of all network nodes based on the knowledge about network structure and labels of some training subset of nodes is called the within-network classification. It may happen that none of the labels of the nodes is known and additionally there is no information about number of classes to which nodes can be assigned. In such a case a subset of nodes has to be selected for…
▽ More
The task of determining labels of all network nodes based on the knowledge about network structure and labels of some training subset of nodes is called the within-network classification. It may happen that none of the labels of the nodes is known and additionally there is no information about number of classes to which nodes can be assigned. In such a case a subset of nodes has to be selected for initial label acquisition. The question that arises is: "labels of which nodes should be collected and used for learning in order to provide the best classification accuracy for the whole network?". Active learning and inference is a practical framework to study this problem.
A set of methods for active learning and inference for within network classification is proposed and validated. The utility score calculation for each node based on network structure is the first step in the process. The scores enable to rank the nodes. Based on the ranking, a set of nodes, for which the labels are acquired, is selected (e.g. by taking top or bottom N from the ranking). The new measure-neighbour methods proposed in the paper suggest not obtaining labels of nodes from the ranking but rather acquiring labels of their neighbours. The paper examines 29 distinct formulations of utility score and selection methods reporting their impact on the results of two collective classification algorithms: Iterative Classification Algorithm and Loopy Belief Propagation.
We advocate that the accuracy of presented methods depends on the structural properties of the examined network. We claim that measure-neighbour methods will work better than the regular methods for networks with higher clustering coefficient and worse than regular methods for networks with low clustering coefficient. According to our hypothesis, based on clustering coefficient we are able to recommend appropriate active learning and inference method.
△ Less
Submitted 5 October, 2015;
originally announced October 2015.
-
Seed Selection for Spread of Influence in Social Networks: Temporal vs. Static Approach
Authors:
Radosław Michalski,
Tomasz Kajdanowicz,
Piotr Bródka,
Przemysław Kazienko
Abstract:
The problem of finding optimal set of users for influencing others in the social network has been widely studied. Because it is NP-hard, some heuristics were proposed to find sub-optimal solutions. Still, one of the commonly used assumption is the one that seeds are chosen on the static network, not the dynamic one. This static approach is in fact far from the real-world networks, where new nodes…
▽ More
The problem of finding optimal set of users for influencing others in the social network has been widely studied. Because it is NP-hard, some heuristics were proposed to find sub-optimal solutions. Still, one of the commonly used assumption is the one that seeds are chosen on the static network, not the dynamic one. This static approach is in fact far from the real-world networks, where new nodes may appear and old ones dynamically disappear in course of time.
The main purpose of this paper is to analyse how the results of one of the typical models for spread of influence - linear threshold - differ depending on the strategy of building the social network used later for choosing seeds. To show the impact of network creation strategy on the final number of influenced nodes - outcome of spread of influence, the results for three approaches were studied: one static and two temporal with different granularities, i.e. various number of time windows. Social networks for each time window encapsulated dynamic changes in the network structure. Calculation of various node structural measures like degree or betweenness respected these changes by means of forgetting mechanism - more recent data had greater influence on node measure values. These measures were, in turn, used for node ranking and their selection for seeding.
All concepts were applied to experimental verification on five real datasets. The results revealed that temporal approach is always better than static and the higher granularity in the temporal social network while seeding, the more finally influenced nodes. Additionally, outdegree measure with exponential forgetting typically outperformed other time-dependent structural measures, if used for seed candidate ranking.
△ Less
Submitted 21 November, 2014; v1 submitted 2 May, 2014;
originally announced May 2014.
-
Parallel Processing of Large Graphs
Authors:
Tomasz Kajdanowicz,
Przemyslaw Kazienko,
Wojciech Indyk
Abstract:
More and more large data collections are gathered worldwide in various IT systems. Many of them possess the networked nature and need to be processed and analysed as graph structures. Due to their size they require very often usage of parallel paradigm for efficient computation. Three parallel techniques have been compared in the paper: MapReduce, its map-side join extension and Bulk Synchronous P…
▽ More
More and more large data collections are gathered worldwide in various IT systems. Many of them possess the networked nature and need to be processed and analysed as graph structures. Due to their size they require very often usage of parallel paradigm for efficient computation. Three parallel techniques have been compared in the paper: MapReduce, its map-side join extension and Bulk Synchronous Parallel (BSP). They are implemented for two different graph problems: calculation of single source shortest paths (SSSP) and collective classification of graph nodes by means of relational influence propagation (RIP). The methods and algorithms are applied to several network datasets differing in size and structural profile, originating from three domains: telecommunication, multimedia and microblog. The results revealed that iterative graph processing with the BSP implementation always and significantly, even up to 10 times outperforms MapReduce, especially for algorithms with many iterations and sparse communication. Also MapReduce extension based on map-side join usually noticeably presents better efficiency, although not as much as BSP. Nevertheless, MapReduce still remains the good alternative for enormous networks, whose data structures do not fit in local memories.
△ Less
Submitted 3 June, 2013;
originally announced June 2013.
-
Privacy-preserving Data Mining, Sharing and Publishing
Authors:
Katarzyna Pasierb,
Tomasz Kajdanowicz,
Przemyslaw Kazienko
Abstract:
The goal of the paper is to present different approaches to privacy-preserving data sharing and publishing in the context of e-health care systems. In particular, the literature review on technical issues in privacy assurance and current real-life high complexity implementation of medical system that assumes proper data sharing mechanisms are presented in the paper.
The goal of the paper is to present different approaches to privacy-preserving data sharing and publishing in the context of e-health care systems. In particular, the literature review on technical issues in privacy assurance and current real-life high complexity implementation of medical system that assumes proper data sharing mechanisms are presented in the paper.
△ Less
Submitted 6 April, 2013;
originally announced April 2013.
-
Social Recommendations within the Multimedia Sharing Systems
Authors:
Katarzyna Musial,
Przemyslaw Kazienkol,
Tomasz Kajdanowicz
Abstract:
The social recommender system that supports the creation of new relations between users in the multimedia sharing system is presented in the paper. To generate suggestions the new concept of the multirelational social network was introduced. It covers both direct as well as object-based relationships that reflect social and semantic links between users. The main goal of the new method is to create…
▽ More
The social recommender system that supports the creation of new relations between users in the multimedia sharing system is presented in the paper. To generate suggestions the new concept of the multirelational social network was introduced. It covers both direct as well as object-based relationships that reflect social and semantic links between users. The main goal of the new method is to create the personalized suggestions that are continuously adapted to users' needs depending on the personal weights assigned to each layer from the social network. The conducted experiments confirmed the usefulness of the proposed model.
△ Less
Submitted 1 March, 2013;
originally announced March 2013.
-
Label-dependent Feature Extraction in Social Networks for Node Classification
Authors:
Tomasz Kajdanowicz,
Przemyslaw Kazienko,
Piotr Doskocz
Abstract:
A new method of feature extraction in the social network for within-network classification is proposed in the paper. The method provides new features calculated by combination of both: network structure information and class labels assigned to nodes. The influence of various features on classification performance has also been studied. The experiments on real-world data have shown that features cr…
▽ More
A new method of feature extraction in the social network for within-network classification is proposed in the paper. The method provides new features calculated by combination of both: network structure information and class labels assigned to nodes. The influence of various features on classification performance has also been studied. The experiments on real-world data have shown that features created owing to the proposed method can lead to significant improvement of classification accuracy.
△ Less
Submitted 1 March, 2013;
originally announced March 2013.
-
Multidimensional Social Network in the Social Recommender System
Authors:
Przemyslaw Kazienko,
Katarzyna Musial,
Tomasz Kajdanowicz
Abstract:
All online sharing systems gather data that reflects users' collective behaviour and their shared activities. This data can be used to extract different kinds of relationships, which can be grouped into layers, and which are basic components of the multidimensional social network proposed in the paper. The layers are created on the basis of two types of relations between humans, i.e. direct and ob…
▽ More
All online sharing systems gather data that reflects users' collective behaviour and their shared activities. This data can be used to extract different kinds of relationships, which can be grouped into layers, and which are basic components of the multidimensional social network proposed in the paper. The layers are created on the basis of two types of relations between humans, i.e. direct and object-based ones which respectively correspond to either social or semantic links between individuals. For better understanding of the complexity of the social network structure, layers and their profiles were identified and studied on two, spanned in time, snapshots of the Flickr population. Additionally, for each layer, a separate strength measure was proposed. The experiments on the Flickr photo sharing system revealed that the relationships between users result either from semantic links between objects they operate on or from social connections of these users. Moreover, the density of the social network increases in time. The second part of the study is devoted to building a social recommender system that supports the creation of new relations between users in a multimedia sharing system. Its main goal is to generate personalized suggestions that are continuously adapted to users' needs depending on the personal weights assigned to each layer in the multidimensional social network. The conducted experiments confirmed the usefulness of the proposed model.
△ Less
Submitted 1 March, 2013;
originally announced March 2013.