-
Reinforcement Learning with Dynamic Multi-Reward Weighting for Multi-Style Controllable Generation
Authors:
Karin de Langis,
Ryan Koo,
Dongyeop Kang
Abstract:
Style is an integral component of text that expresses a diverse set of information, including interpersonal dynamics (e.g. formality) and the author's emotions or attitudes (e.g. disgust). Humans often employ multiple styles simultaneously. An open question is how large language models can be explicitly controlled so that they weave together target styles when generating text: for example, to prod…
▽ More
Style is an integral component of text that expresses a diverse set of information, including interpersonal dynamics (e.g. formality) and the author's emotions or attitudes (e.g. disgust). Humans often employ multiple styles simultaneously. An open question is how large language models can be explicitly controlled so that they weave together target styles when generating text: for example, to produce text that is both negative and non-toxic. Previous work investigates the controlled generation of a single style, or else controlled generation of a style and other attributes. In this paper, we expand this into controlling multiple styles simultaneously. Specifically, we investigate various formulations of multiple style rewards for a reinforcement learning (RL) approach to controlled multi-style generation. These reward formulations include calibrated outputs from discriminators and dynamic weighting by discriminator gradient magnitudes. We find that dynamic weighting generally outperforms static weighting approaches, and we explore its effectiveness in 2- and 3-style control, even compared to strong baselines like plug-and-play model. All code and data for RL pipelines with multiple style attributes will be publicly available.
△ Less
Submitted 21 February, 2024;
originally announced February 2024.
-
Under the Surface: Tracking the Artifactuality of LLM-Generated Data
Authors:
Debarati Das,
Karin De Langis,
Anna Martin-Boyle,
Jaehyung Kim,
Minhwa Lee,
Zae Myung Kim,
Shirley Anugrah Hayati,
Risako Owan,
Bin Hu,
Ritik Parkar,
Ryan Koo,
Jonginn Park,
Aahan Tyagi,
Libby Ferland,
Sanjali Roy,
Vincent Liu,
Dongyeop Kang
Abstract:
This work delves into the expanding role of large language models (LLMs) in generating artificial data. LLMs are increasingly employed to create a variety of outputs, including annotations, preferences, instruction prompts, simulated dialogues, and free text. As these forms of LLM-generated data often intersect in their application, they exert mutual influence on each other and raise significant c…
▽ More
This work delves into the expanding role of large language models (LLMs) in generating artificial data. LLMs are increasingly employed to create a variety of outputs, including annotations, preferences, instruction prompts, simulated dialogues, and free text. As these forms of LLM-generated data often intersect in their application, they exert mutual influence on each other and raise significant concerns about the quality and diversity of the artificial data incorporated into training cycles, leading to an artificial data ecosystem. To the best of our knowledge, this is the first study to aggregate various types of LLM-generated text data, from more tightly constrained data like "task labels" to more lightly constrained "free-form text". We then stress test the quality and implications of LLM-generated artificial data, comparing it with human data across various existing benchmarks. Despite artificial data's capability to match human performance, this paper reveals significant hidden disparities, especially in complex tasks where LLMs often miss the nuanced understanding of intrinsic human-generated content. This study critically examines diverse LLM-generated data and emphasizes the need for ethical practices in data creation and when using LLMs. It highlights the LLMs' shortcomings in replicating human traits and behaviors, underscoring the importance of addressing biases and artifacts produced in LLM-generated content for future research and development. All data and code are available on our project page.
△ Less
Submitted 30 January, 2024; v1 submitted 26 January, 2024;
originally announced January 2024.
-
Benchmarking Cognitive Biases in Large Language Models as Evaluators
Authors:
Ryan Koo,
Minhwa Lee,
Vipul Raheja,
Jong Inn Park,
Zae Myung Kim,
Dongyeop Kang
Abstract:
Large Language Models (LLMs) have recently been shown to be effective as automatic evaluators with simple prompting and in-context learning. In this work, we assemble 15 LLMs of four different size ranges and evaluate their output responses by preference ranking from the other LLMs as evaluators, such as System Star is better than System Square. We then evaluate the quality of ranking outputs intr…
▽ More
Large Language Models (LLMs) have recently been shown to be effective as automatic evaluators with simple prompting and in-context learning. In this work, we assemble 15 LLMs of four different size ranges and evaluate their output responses by preference ranking from the other LLMs as evaluators, such as System Star is better than System Square. We then evaluate the quality of ranking outputs introducing the Cognitive Bias Benchmark for LLMs as Evaluators (CoBBLEr), a benchmark to measure six different cognitive biases in LLM evaluation outputs, such as the Egocentric bias where a model prefers to rank its own outputs highly in evaluation. We find that LLMs are biased text quality evaluators, exhibiting strong indications on our bias benchmark (average of 40% of comparisons across all models) within each of their evaluations that question their robustness as evaluators. Furthermore, we examine the correlation between human and machine preferences and calculate the average Rank-Biased Overlap (RBO) score to be 49.6%, indicating that machine preferences are misaligned with humans. According to our findings, LLMs may still be unable to be utilized for automatic annotation aligned with human preferences. Our project page is at: https://minnesotanlp.github.io/cobbler.
△ Less
Submitted 29 September, 2023;
originally announced September 2023.
-
CoEdIT: Text Editing by Task-Specific Instruction Tuning
Authors:
Vipul Raheja,
Dhruv Kumar,
Ryan Koo,
Dongyeop Kang
Abstract:
We introduce CoEdIT, a state-of-the-art text editing system for writing assistance. CoEdIT takes instructions from the user specifying the attributes of the desired text, such as "Make the sentence simpler" or "Write it in a more neutral style," and outputs the edited text. We present a large language model fine-tuned on a diverse collection of task-specific instructions for text editing (a total…
▽ More
We introduce CoEdIT, a state-of-the-art text editing system for writing assistance. CoEdIT takes instructions from the user specifying the attributes of the desired text, such as "Make the sentence simpler" or "Write it in a more neutral style," and outputs the edited text. We present a large language model fine-tuned on a diverse collection of task-specific instructions for text editing (a total of 82K instructions). Our model (1) achieves state-of-the-art performance on various text editing benchmarks, (2) is competitive with publicly available largest-sized LLMs trained on instructions while being nearly 60x smaller, (3) is capable of generalizing to unseen edit instructions, and (4) exhibits abilities to generalize to composite instructions containing different combinations of edit actions. Through extensive qualitative and quantitative analysis, we show that writers prefer the edits suggested by CoEdIT relative to other state-of-the-art text editing models. Our code, data, and models are publicly available at https://github.com/vipulraheja/coedit.
△ Less
Submitted 23 October, 2023; v1 submitted 16 May, 2023;
originally announced May 2023.
-
Decoding the End-to-end Writing Trajectory in Scholarly Manuscripts
Authors:
Ryan Koo,
Anna Martin,
Linghe Wang,
Dongyeop Kang
Abstract:
Scholarly writing presents a complex space that generally follows a methodical procedure to plan and produce both rationally sound and creative compositions. Recent works involving large language models (LLM) demonstrate considerable success in text generation and revision tasks; however, LLMs still struggle to provide structural and creative feedback on the document level that is crucial to acade…
▽ More
Scholarly writing presents a complex space that generally follows a methodical procedure to plan and produce both rationally sound and creative compositions. Recent works involving large language models (LLM) demonstrate considerable success in text generation and revision tasks; however, LLMs still struggle to provide structural and creative feedback on the document level that is crucial to academic writing. In this paper, we introduce a novel taxonomy that categorizes scholarly writing behaviors according to intention, writer actions, and the information types of the written data. We also provide ManuScript, an original dataset annotated with a simplified version of our taxonomy to show writer actions and the intentions behind them. Motivated by cognitive writing theory, our taxonomy for scientific papers includes three levels of categorization in order to trace the general writing flow and identify the distinct writer activities embedded within each higher-level process. ManuScript intends to provide a complete picture of the scholarly writing process by capturing the linearity and non-linearity of writing trajectory, such that writing assistants can provide stronger feedback and suggestions on an end-to-end level. The collected writing trajectories are viewed at https://minnesotanlp.github.io/REWARD_demo/
△ Less
Submitted 31 March, 2023;
originally announced April 2023.
-
TypeScript's Evolution: An Analysis of Feature Adoption Over Time
Authors:
Joshua D. Scarsbrook,
Mark Utting,
Ryan K. L. Ko
Abstract:
TypeScript is a quickly evolving superset of JavaScript with active development of new features. Our paper seeks to understand how quickly these features are adopted by the developer community. Existing work in JavaScript shows the adoption of dynamic language features can be a major hindrance to static analysis. As TypeScript evolves the addition of features makes the underlying standard more and…
▽ More
TypeScript is a quickly evolving superset of JavaScript with active development of new features. Our paper seeks to understand how quickly these features are adopted by the developer community. Existing work in JavaScript shows the adoption of dynamic language features can be a major hindrance to static analysis. As TypeScript evolves the addition of features makes the underlying standard more and more difficult to keep up with. In our work we present an analysis of 454 open source TypeScript repositories and study the adoption of 13 language features over the past three years. We show that while new versions of the TypeScript compiler are aggressively adopted by the community, the same cannot be said for language features. While some experience strong growth others are rarely adopted by projects. Our work serves as a starting point for future study of the adoption of features in TypeScript. We also release our analysis and data gathering software as open source in the hope it helps the programming languages community.
△ Less
Submitted 17 March, 2023;
originally announced March 2023.
-
Ultraverse: Efficient Retroactive Operation for Attack Recovery in Database Systems and Web Frameworks
Authors:
Ronny Ko,
Chuan Xiao,
Makoto Onizuka,
Yihe Huang,
Zhiqiang Lin
Abstract:
Retroactive operation is an operation that changes a past operation in a series of committed ones (e.g., cancelling the past insertion of '5' into a queue committed at t=3). Retroactive operation has many important security applications such as attack recovery or private data removal (e.g., for GDPR compliance). While prior efforts designed retroactive algorithms for low-level data structures (e.g…
▽ More
Retroactive operation is an operation that changes a past operation in a series of committed ones (e.g., cancelling the past insertion of '5' into a queue committed at t=3). Retroactive operation has many important security applications such as attack recovery or private data removal (e.g., for GDPR compliance). While prior efforts designed retroactive algorithms for low-level data structures (e.g., queue, set), none explored retroactive operation for higher levels, such as database systems or web applications. This is challenging, because: (i) SQL semantics of database systems is complex; (ii) data records can flow through various web application components, such as HTML's DOM trees, server-side user request handlers, and client-side JavaScript code. We propose Ultraverse, the first retroactive operation framework comprised of two components: a database system and a web application framework. The former enables users to retroactively change committed SQL queries; the latter does the same for web applications with preserving correctness of application semantics. Our experimental results show that Ultraverse achieves 10.5x~693.3x speedup on retroactive database update compared to a regular DBMS's flashback and redo.
△ Less
Submitted 2 January, 2023; v1 submitted 9 November, 2022;
originally announced November 2022.
-
Positive-Unlabeled Learning using Random Forests via Recursive Greedy Risk Minimization
Authors:
Jonathan Wilton,
Abigail M. Y. Koay,
Ryan K. L. Ko,
Miao Xu,
Nan Ye
Abstract:
The need to learn from positive and unlabeled data, or PU learning, arises in many applications and has attracted increasing interest. While random forests are known to perform well on many tasks with positive and negative data, recent PU algorithms are generally based on deep neural networks, and the potential of tree-based PU learning is under-explored. In this paper, we propose new random fores…
▽ More
The need to learn from positive and unlabeled data, or PU learning, arises in many applications and has attracted increasing interest. While random forests are known to perform well on many tasks with positive and negative data, recent PU algorithms are generally based on deep neural networks, and the potential of tree-based PU learning is under-explored. In this paper, we propose new random forest algorithms for PU-learning. Key to our approach is a new interpretation of decision tree algorithms for positive and negative data as \emph{recursive greedy risk minimization algorithms}. We extend this perspective to the PU setting to develop new decision tree learning algorithms that directly minimizes PU-data based estimators for the expected risk. This allows us to develop an efficient PU random forest algorithm, PU extra trees. Our approach features three desirable properties: it is robust to the choice of the loss function in the sense that various loss functions lead to the same decision trees; it requires little hyperparameter tuning as compared to neural network based PU learning; it supports a feature importance that directly measures a feature's contribution to risk minimization. Our algorithms demonstrate strong performance on several datasets. Our code is available at \url{https://github.com/puetpaper/PUExtraTrees}.
△ Less
Submitted 16 October, 2022;
originally announced October 2022.
-
FDGATII : Fast Dynamic Graph Attention with Initial Residual and Identity Map**
Authors:
Gayan K. Kulatilleke,
Marius Portmann,
Ryan Ko,
Shekhar S. Chandra
Abstract:
While Graph Neural Networks have gained popularity in multiple domains, graph-structured input remains a major challenge due to (a) over-smoothing, (b) noisy neighbours (heterophily), and (c) the suspended animation problem. To address all these problems simultaneously, we propose a novel graph neural network FDGATII, inspired by attention mechanism's ability to focus on selective information supp…
▽ More
While Graph Neural Networks have gained popularity in multiple domains, graph-structured input remains a major challenge due to (a) over-smoothing, (b) noisy neighbours (heterophily), and (c) the suspended animation problem. To address all these problems simultaneously, we propose a novel graph neural network FDGATII, inspired by attention mechanism's ability to focus on selective information supplemented with two feature preserving mechanisms. FDGATII combines Initial Residuals and Identity Map** with the more expressive dynamic self-attention to handle noise prevalent from the neighbourhoods in heterophilic data sets. By using sparse dynamic attention, FDGATII is inherently parallelizable in design, whist efficient in operation; thus theoretically able to scale to arbitrary graphs with ease. Our approach has been extensively evaluated on 7 datasets. We show that FDGATII outperforms GAT and GCN based benchmarks in accuracy and performance on fully supervised tasks, obtaining state-of-the-art results on Chameleon and Cornell datasets with zero domain-specific graph pre-processing, and demonstrate its versatility and fairness.
△ Less
Submitted 25 October, 2021; v1 submitted 21 October, 2021;
originally announced October 2021.
-
Confined Gradient Descent: Privacy-preserving Optimization for Federated Learning
Authors:
Yanjun Zhang,
Guangdong Bai,
Xue Li,
Surya Nepal,
Ryan K L Ko
Abstract:
Federated learning enables multiple participants to collaboratively train a model without aggregating the training data. Although the training data are kept within each participant and the local gradients can be securely synthesized, recent studies have shown that such privacy protection is insufficient. The global model parameters that have to be shared for optimization are susceptible to leak in…
▽ More
Federated learning enables multiple participants to collaboratively train a model without aggregating the training data. Although the training data are kept within each participant and the local gradients can be securely synthesized, recent studies have shown that such privacy protection is insufficient. The global model parameters that have to be shared for optimization are susceptible to leak information about training data. In this work, we propose Confined Gradient Descent (CGD) that enhances privacy of federated learning by eliminating the sharing of global model parameters. CGD exploits the fact that a gradient descent optimization can start with a set of discrete points and converges to another set at the neighborhood of the global minimum of the objective function. It lets the participants independently train on their local data, and securely share the sum of local gradients to benefit each other. We formally demonstrate CGD's privacy enhancement over traditional FL. We prove that less information is exposed in CGD compared to that of traditional FL. CGD also guarantees desired model accuracy. We theoretically establish a convergence rate for CGD. We prove that the loss of the proprietary models learned for each participant against a model learned by aggregated training data is bounded. Extensive experimental results on two real-world datasets demonstrate the performance of CGD is comparable with the centralized learning, with marginal differences on validation loss (mostly within 0.05) and accuracy (mostly within 1%).
△ Less
Submitted 27 April, 2021;
originally announced April 2021.
-
ColdPress: An Extensible Malware Analysis Platform for Threat Intelligence
Authors:
Haoxi Tan,
Mahin Chandramohan,
Cristina Cifuentes,
Guangdong Bai,
Ryan K. L. Ko
Abstract:
Malware analysis is still largely a manual task. This slow and inefficient approach does not scale to the exponential rise in the rate of new unique malware generated. Hence, automating the process as much as possible becomes desirable.
In this paper, we present ColdPress - an extensible malware analysis platform that automates the end-to-end process of malware threat intelligence gathering inte…
▽ More
Malware analysis is still largely a manual task. This slow and inefficient approach does not scale to the exponential rise in the rate of new unique malware generated. Hence, automating the process as much as possible becomes desirable.
In this paper, we present ColdPress - an extensible malware analysis platform that automates the end-to-end process of malware threat intelligence gathering integrated output modules to perform report generation of arbitrary file formats. ColdPress combines state-of-the-art tools and concepts into a modular system that aids the analyst to efficiently and effectively extract information from malware samples. It is designed as a user-friendly and extensible platform that can be easily extended with user-defined modules.
We evaluated ColdPress with complex real-world malware samples (e.g., WannaCry), demonstrating its efficiency, performance and usefulness to security analysts.
△ Less
Submitted 11 March, 2021;
originally announced March 2021.
-
An Analytics Framework for Heuristic Inference Attacks against Industrial Control Systems
Authors:
Taejun Choi,
Guangdong Bai,
Ryan K L Ko,
Naipeng Dong,
Wenlu Zhang,
Shunyao Wang
Abstract:
Industrial control systems (ICS) of critical infrastructure are increasingly connected to the Internet for remote site management at scale. However, cyber attacks against ICS - especially at the communication channels between humanmachine interface (HMIs) and programmable logic controllers (PLCs) - are increasing at a rate which outstrips the rate of mitigation.
In this paper, we introduce a ven…
▽ More
Industrial control systems (ICS) of critical infrastructure are increasingly connected to the Internet for remote site management at scale. However, cyber attacks against ICS - especially at the communication channels between humanmachine interface (HMIs) and programmable logic controllers (PLCs) - are increasing at a rate which outstrips the rate of mitigation.
In this paper, we introduce a vendor-agnostic analytics framework which allows security researchers to analyse attacks against ICS systems, even if the researchers have zero control automation domain knowledge or are faced with a myriad of heterogenous ICS systems. Unlike existing works that require expertise in domain knowledge and specialised tool usage, our analytics framework does not require prior knowledge about ICS communication protocols, PLCs, and expertise of any network penetration testing tool. Using `digital twin' scenarios comprising industry-representative HMIs, PLCs and firewalls in our test lab, our framework's steps were demonstrated to successfully implement a stealthy deception attack based on false data injection attacks (FDIA). Furthermore, our framework also demonstrated the relative ease of attack dataset collection, and the ability to leverage well-known penetration testing tools.
We also introduce the concept of `heuristic inference attacks', a new family of attack types on ICS which is agnostic to PLC and HMI brands/models commonly deployed in ICS. Our experiments were also validated on a separate ICS dataset collected from a cyber-physical scenario of water utilities. Finally, we utilized time complexity theory to estimate the difficulty for the attacker to conduct the proposed packet analyses, and recommended countermeasures based on our findings.
△ Less
Submitted 28 January, 2021;
originally announced January 2021.
-
Cyber Autonomy: Automating the Hacker- Self-healing, self-adaptive, automatic cyber defense systems and their impact to the industry, society and national security
Authors:
Ryan K L Ko
Abstract:
This paper sets the context for the urgency for cyber autonomy, and the current gaps of the cyber security industry. A novel framework proposing four phases of maturity for full cyber autonomy will be discussed. The paper also reviews new and emerging cyber security automation techniques and tools, and discusses their impact on society, the perceived cyber security skills gap/shortage and national…
▽ More
This paper sets the context for the urgency for cyber autonomy, and the current gaps of the cyber security industry. A novel framework proposing four phases of maturity for full cyber autonomy will be discussed. The paper also reviews new and emerging cyber security automation techniques and tools, and discusses their impact on society, the perceived cyber security skills gap/shortage and national security. We will also be discussing the delicate balance between national security, human rights and ethics, and the potential demise of the manual penetration testing industry in the face of automation.
△ Less
Submitted 8 December, 2020;
originally announced December 2020.
-
Pandora: A Cyber Range Environment for the Safe Testing and Deployment of Autonomous Cyber Attack Tools
Authors:
Hetong Jiang,
Taejun Choi,
Ryan K. L. Ko
Abstract:
Cybersecurity tools are increasingly automated with artificial intelligent (AI) capabilities to match the exponential scale of attacks, compensate for the relatively slower rate of training new cybersecurity talents, and improve of the accuracy and performance of both tools and users. However, the safe and appropriate usage of autonomous cyber attack tools - especially at the development stages fo…
▽ More
Cybersecurity tools are increasingly automated with artificial intelligent (AI) capabilities to match the exponential scale of attacks, compensate for the relatively slower rate of training new cybersecurity talents, and improve of the accuracy and performance of both tools and users. However, the safe and appropriate usage of autonomous cyber attack tools - especially at the development stages for these tools - is still largely an unaddressed gap. Our survey of current literature and tools showed that most of the existing cyber range designs are mostly using manual tools and have not considered augmenting automated tools or the potential security issues caused by the tools. In other words, there is still room for a novel cyber range design which allow security researchers to safely deploy autonomous tools and perform automated tool testing if needed. In this paper, we introduce Pandora, a safe testing environment which allows security researchers and cyber range users to perform experiments on automated cyber attack tools that may have strong potential of usage and at the same time, a strong potential for risks. Unlike existing testbeds and cyber ranges which have direct compatibility with enterprise computer systems and the potential for risk propagation across the enterprise network, our test system is intentionally designed to be incompatible with enterprise real-world computing systems to reduce the risk of attack propagation into actual infrastructure. Our design also provides a tool to convert in-development automated cyber attack tools into to executable test binaries for validation and usage realistic enterprise system environments if required. Our experiments tested automated attack tools on our proposed system to validate the usability of our proposed environment. Our experiments also proved the safety of our environment by compatibility testing using simple malicious code.
△ Less
Submitted 24 September, 2020;
originally announced September 2020.
-
PrivColl: Practical Privacy-Preserving Collaborative Machine Learning
Authors:
Yanjun Zhang,
Guangdong Bai,
Xue Li,
Caitlin Curtis,
Chen Chen,
Ryan K L Ko
Abstract:
Collaborative learning enables two or more participants, each with their own training dataset, to collaboratively learn a joint model. It is desirable that the collaboration should not cause the disclosure of either the raw datasets of each individual owner or the local model parameters trained on them. This privacy-preservation requirement has been approached through differential privacy mechanis…
▽ More
Collaborative learning enables two or more participants, each with their own training dataset, to collaboratively learn a joint model. It is desirable that the collaboration should not cause the disclosure of either the raw datasets of each individual owner or the local model parameters trained on them. This privacy-preservation requirement has been approached through differential privacy mechanisms, homomorphic encryption (HE) and secure multiparty computation (MPC), but existing attempts may either introduce the loss of model accuracy or imply significant computational and/or communicational overhead. In this work, we address this problem with the lightweight additive secret sharing technique. We propose PrivColl, a framework for protecting local data and local models while ensuring the correctness of training processes. PrivColl employs secret sharing technique for securely evaluating addition operations in a multiparty computation environment, and achieves practicability by employing only the homomorphic addition operations. We formally prove that it guarantees privacy preservation even though the majority (n-2 out of n) of participants are corrupted. With experiments on real-world datasets, we further demonstrate that PrivColl retains high efficiency. It achieves a speedup of more than 45X over the state-of-the-art MPC/HE based schemes for training linear/logistic regression, and 216X faster for training neural network.
△ Less
Submitted 14 July, 2020;
originally announced July 2020.
-
Local Editing in LZ-End Compressed Data
Authors:
Daniel Roodt,
Ulrich Speidel,
Vimal Kumar,
Ryan K. L. Ko
Abstract:
This paper presents an algorithm for the modification of data compressed using LZ-End, a derivate of LZ77, without prior decompression. The performance of the algorithm and the impact of the modifications on the compression ratio is evaluated. Finally, we discuss the importance of this work as a first step towards local editing in Lempel-Ziv compressed data.
This paper presents an algorithm for the modification of data compressed using LZ-End, a derivate of LZ77, without prior decompression. The performance of the algorithm and the impact of the modifications on the compression ratio is evaluated. Finally, we discuss the importance of this work as a first step towards local editing in Lempel-Ziv compressed data.
△ Less
Submitted 12 July, 2020;
originally announced July 2020.
-
A Classifier-free Ensemble Selection Method based on Data Diversity in Random Subspaces
Authors:
Albert H. R. Ko,
Robert Sabourin,
Alceu S. Britto Jr,
Luiz E. S. Oliveira
Abstract:
The Ensemble of Classifiers (EoC) has been shown to be effective in improving the performance of single classifiers by combining their outputs, and one of the most important properties involved in the selection of the best EoC from a pool of classifiers is considered to be classifier diversity. In general, classifier diversity does not occur randomly, but is generated systematically by various ens…
▽ More
The Ensemble of Classifiers (EoC) has been shown to be effective in improving the performance of single classifiers by combining their outputs, and one of the most important properties involved in the selection of the best EoC from a pool of classifiers is considered to be classifier diversity. In general, classifier diversity does not occur randomly, but is generated systematically by various ensemble creation methods. By using diverse data subsets to train classifiers, these methods can create diverse classifiers for the EoC. In this work, we propose a scheme to measure data diversity directly from random subspaces, and explore the possibility of using it to select the best data subsets for the construction of the EoC. Our scheme is the first ensemble selection method to be presented in the literature based on the concept of data diversity. Its main advantage over the traditional framework (ensemble creation then selection) is that it obviates the need for classifier training prior to ensemble selection. A single Genetic Algorithm (GA) and a Multi-Objective Genetic Algorithm (MOGA) were evaluated to search for the best solutions for the classifier-free ensemble selection. In both cases, objective functions based on different clustering diversity measures were implemented and tested. All the results obtained with the proposed classifier-free ensemble selection method were compared with the traditional classifier-based ensemble selection using Mean Classifier Error (ME) and Majority Voting Error (MVE). The applicability of the method is tested on UCI machine learning problems and NIST SD19 handwritten numerals.
△ Less
Submitted 12 August, 2014;
originally announced August 2014.
-
From Linked Data to Relevant Data -- Time is the Essence
Authors:
Markus Kirchberg,
Ryan K L Ko,
Bu Sung Lee
Abstract:
The Semantic Web initiative puts emphasis not primarily on putting data on the Web, but rather on creating links in a way that both humans and machines can explore the Web of data. When such users access the Web, they leave a trail as Web servers maintain a history of requests. Web usage mining approaches have been studied since the beginning of the Web given the log's huge potential for purposes…
▽ More
The Semantic Web initiative puts emphasis not primarily on putting data on the Web, but rather on creating links in a way that both humans and machines can explore the Web of data. When such users access the Web, they leave a trail as Web servers maintain a history of requests. Web usage mining approaches have been studied since the beginning of the Web given the log's huge potential for purposes such as resource annotation, personalization, forecasting etc. However, the impact of any such efforts has not really gone beyond generating statistics detailing who, when, and how Web pages maintained by a Web server were visited.
△ Less
Submitted 25 March, 2011;
originally announced March 2011.