-
REPOEXEC: Evaluate Code Generation with a Repository-Level Executable Benchmark
Authors:
Nam Le Hai,
Dung Manh Nguyen,
Nghi D. Q. Bui
Abstract:
The ability of CodeLLMs to generate executable and functionally correct code at the repository-level scale remains largely unexplored. We introduce RepoExec, a novel benchmark for evaluating code generation at the repository-level scale. RepoExec focuses on three main aspects: executability, functional correctness through automated test case generation with high coverage rate, and carefully crafte…
▽ More
The ability of CodeLLMs to generate executable and functionally correct code at the repository-level scale remains largely unexplored. We introduce RepoExec, a novel benchmark for evaluating code generation at the repository-level scale. RepoExec focuses on three main aspects: executability, functional correctness through automated test case generation with high coverage rate, and carefully crafted cross-file contexts to accurately generate code. Our work explores a controlled scenario where developers specify necessary code dependencies, challenging the model to integrate these accurately. Experiments show that while pretrained LLMs outperform instruction-tuned models in correctness, the latter excel in utilizing provided dependencies and demonstrating debugging capabilities. We also introduce a new instruction-tuned dataset that focuses on code dependencies and demonstrate that CodeLLMs fine-tuned on our dataset have a better capability to leverage these dependencies effectively. RepoExec aims to provide a comprehensive evaluation of code functionality and alignment with developer intent, paving the way for more reliable and applicable CodeLLMs in real-world scenarios. The dataset and source code can be found at~\url{https://github.com/FSoft-AI4Code/RepoExec}.
△ Less
Submitted 19 June, 2024; v1 submitted 17 June, 2024;
originally announced June 2024.
-
AgileCoder: Dynamic Collaborative Agents for Software Development based on Agile Methodology
Authors:
Minh Huynh Nguyen,
Thang Phan Chau,
Phong X. Nguyen,
Nghi D. Q. Bui
Abstract:
Software agents have emerged as promising tools for addressing complex software engineering tasks. However, existing works oversimplify software development workflows by following the waterfall model. Thus, we propose AgileCoder, a multi-agent system that integrates Agile Methodology (AM) into the framework. This system assigns specific AM roles such as Product Manager, Developer, and Tester to di…
▽ More
Software agents have emerged as promising tools for addressing complex software engineering tasks. However, existing works oversimplify software development workflows by following the waterfall model. Thus, we propose AgileCoder, a multi-agent system that integrates Agile Methodology (AM) into the framework. This system assigns specific AM roles such as Product Manager, Developer, and Tester to different agents, who then collaboratively develop software based on user inputs. AgileCoder enhances development efficiency by organizing work into sprints, focusing on incrementally develo** software through sprints. Additionally, we introduce Dynamic Code Graph Generator, a module that creates a Code Dependency Graph dynamically as updates are made to the codebase. This allows agents to better comprehend the codebase, leading to more precise code generation and modifications throughout the software development process. AgileCoder surpasses existing benchmarks, like ChatDev and MetaGPT, establishing a new standard and showcasing the capabilities of multi-agent systems in advanced software engineering environments. Our source code can be found at https://github.com/FSoft-AI4Code/AgileCoder.
△ Less
Submitted 16 June, 2024;
originally announced June 2024.
-
Improved All-Pairs Approximate Shortest Paths in Congested Clique
Authors:
Hong Duc Bui,
Shashwat Chandra,
Yi-Jun Chang,
Michal Dory,
Dean Leitersdorf
Abstract:
In this paper, we present new algorithms for approximating All-Pairs Shortest Paths (APSP) in the Congested Clique model. We present randomized algorithms for weighted undirected graphs.
Our first contribution is an $O(1)$-approximate APSP algorithm taking just $O(\log \log \log n)$ rounds. Prior to our work, the fastest algorithms that give an $O(1)$-approximation for APSP take…
▽ More
In this paper, we present new algorithms for approximating All-Pairs Shortest Paths (APSP) in the Congested Clique model. We present randomized algorithms for weighted undirected graphs.
Our first contribution is an $O(1)$-approximate APSP algorithm taking just $O(\log \log \log n)$ rounds. Prior to our work, the fastest algorithms that give an $O(1)$-approximation for APSP take $\operatorname{poly}(\log{n})$ rounds in weighted undirected graphs, and $\operatorname{poly}(\log \log n)$ rounds in unweighted undirected graphs.
If we terminate the execution of the algorithm early, we obtain an $O(t)$-round algorithm that yields an $O \big( (\log n)^{1/2^t} \big) $ distance approximation for a parameter $t$. The trade-off between $t$ and the approximation quality provides flexibility for different scenarios, allowing the algorithm to adapt to specific requirements. In particular, we can get an $O \big( (\log n)^{1/2^t} \big) $-approximation for any constant $t$ in $O(1)$-rounds. Such result was previously known only for the special case that $t=0$.
A key ingredient in our algorithm is a lemma that allows to improve an $O(a)$-approximation for APSP to an $O(\sqrt{a})$-approximation for APSP in $O(1)$ rounds. To prove the lemma, we develop several new tools, including $O(1)$-round algorithms for computing the $k$ closest nodes, a certain type of hopset, and skeleton graphs.
△ Less
Submitted 4 May, 2024;
originally announced May 2024.
-
The Trade-off between Performance, Efficiency, and Fairness in Adapter Modules for Text Classification
Authors:
Minh Duc Bui,
Katharina von der Wense
Abstract:
Current natural language processing (NLP) research tends to focus on only one or, less frequently, two dimensions - e.g., performance, privacy, fairness, or efficiency - at a time, which may lead to suboptimal conclusions and often overlooking the broader goal of achieving trustworthy NLP. Work on adapter modules (Houlsby et al., 2019; Hu et al., 2021) focuses on improving performance and efficien…
▽ More
Current natural language processing (NLP) research tends to focus on only one or, less frequently, two dimensions - e.g., performance, privacy, fairness, or efficiency - at a time, which may lead to suboptimal conclusions and often overlooking the broader goal of achieving trustworthy NLP. Work on adapter modules (Houlsby et al., 2019; Hu et al., 2021) focuses on improving performance and efficiency, with no investigation of unintended consequences on other aspects such as fairness. To address this gap, we conduct experiments on three text classification datasets by either (1) finetuning all parameters or (2) using adapter modules. Regarding performance and efficiency, we confirm prior findings that the accuracy of adapter-enhanced models is roughly on par with that of fully finetuned models, while training time is substantially reduced. Regarding fairness, we show that adapter modules result in mixed fairness across sensitive groups. Further investigation reveals that, when the standard fine-tuned model exhibits limited biases, adapter modules typically do not introduce extra bias. On the other hand, when the finetuned model exhibits increased bias, the impact of adapter modules on bias becomes more unpredictable, introducing the risk of significantly magnifying these biases for certain groups. Our findings highlight the need for a case-by-case evaluation rather than a one-size-fits-all judgment.
△ Less
Submitted 3 May, 2024;
originally announced May 2024.
-
Knowledge Distillation vs. Pretraining from Scratch under a Fixed (Computation) Budget
Authors:
Minh Duc Bui,
Fabian David Schmidt,
Goran Glavaš,
Katharina von der Wense
Abstract:
Compared to standard language model (LM) pretraining (i.e., from scratch), Knowledge Distillation (KD) entails an additional forward pass through a teacher model that is typically substantially larger than the target student model. As such, KD in LM pretraining materially slows down throughput of pretraining instances vis-a-vis pretraining from scratch. Scaling laws of LM pretraining suggest that…
▽ More
Compared to standard language model (LM) pretraining (i.e., from scratch), Knowledge Distillation (KD) entails an additional forward pass through a teacher model that is typically substantially larger than the target student model. As such, KD in LM pretraining materially slows down throughput of pretraining instances vis-a-vis pretraining from scratch. Scaling laws of LM pretraining suggest that smaller models can close the gap to larger counterparts if trained on more data (i.e., processing more tokens)-and under a fixed computation budget, smaller models are able be process more data than larger models. We thus hypothesize that KD might, in fact, be suboptimal to pretraining from scratch for obtaining smaller LMs, when appropriately accounting for the compute budget. To test this, we compare pretraining from scratch against several KD strategies for masked language modeling (MLM) in a fair experimental setup, with respect to amount of computation as well as pretraining data. Downstream results on GLUE, however, do not confirm our hypothesis: while pretraining from scratch performs comparably to ordinary KD under a fixed computation budget, more sophisticated KD strategies, namely TinyBERT (Jiao et al., 2020) and MiniLM (Wang et al., 2023), outperform it by a notable margin. We further find that KD yields larger gains over pretraining from scratch when the data must be repeated under the fixed computation budget.
△ Less
Submitted 30 April, 2024;
originally announced April 2024.
-
Radial Basis Function Neural Networks for Formation Control of Unmanned Aerial Vehicles
Authors:
Duy-Nam Bui,
Manh Duong Phung
Abstract:
This paper addresses the problem of controlling multiple unmanned aerial vehicles (UAVs) cooperating in a formation to carry out a complex task such as surface inspection. We first use the virtual leader-follower model to determine the topology and trajectory of the formation. A double-loop control system combining backstep** and sliding mode control techniques is then designed for the UAVs to t…
▽ More
This paper addresses the problem of controlling multiple unmanned aerial vehicles (UAVs) cooperating in a formation to carry out a complex task such as surface inspection. We first use the virtual leader-follower model to determine the topology and trajectory of the formation. A double-loop control system combining backstep** and sliding mode control techniques is then designed for the UAVs to track the trajectory. A radial basis function neural network (RBFNN) capable of estimating external disturbances is developed to enhance the robustness of the controller. The stability of the controller is proven by using the Lyapunov theorem. A number of comparisons and software-in-the-loop (SIL) tests have been conducted to evaluate the performance of the proposed controller. The results show that our controller not only outperforms other state-of-the-art controllers but is also sufficient for complex tasks of UAVs such as collecting surface data for inspection. The source code of our controller can be found at https://github.com/duynamrcv/rbf_bsmc
△ Less
Submitted 21 April, 2024;
originally announced April 2024.
-
Envisioning the Next-Generation AI Coding Assistants: Insights & Proposals
Authors:
Khanh Nghiem,
Anh Minh Nguyen,
Nghi D. Q. Bui
Abstract:
As a research-product hybrid group in AI for Software Engineering (AI4SE), we present four key takeaways from our experience develo** in-IDE AI coding assistants. AI coding assistants should set clear expectations for usage, integrate with advanced IDE capabilities and existing extensions, use extendable backend designs, and collect app data responsibly for downstream analyses. We propose open q…
▽ More
As a research-product hybrid group in AI for Software Engineering (AI4SE), we present four key takeaways from our experience develo** in-IDE AI coding assistants. AI coding assistants should set clear expectations for usage, integrate with advanced IDE capabilities and existing extensions, use extendable backend designs, and collect app data responsibly for downstream analyses. We propose open questions and challenges that academia and industry should address to realize the vision of next-generation AI coding assistants.
△ Less
Submitted 21 March, 2024;
originally announced March 2024.
-
CLEAR: Cross-Transformers with Pre-trained Language Model is All you need for Person Attribute Recognition and Retrieval
Authors:
Doanh C. Bui,
Thinh V. Le,
Ba Hung Ngo,
Tae Jong Choi
Abstract:
Person attribute recognition and attribute-based retrieval are two core human-centric tasks. In the recognition task, the challenge is specifying attributes depending on a person's appearance, while the retrieval task involves searching for matching persons based on attribute queries. There is a significant relationship between recognition and retrieval tasks. In this study, we demonstrate that if…
▽ More
Person attribute recognition and attribute-based retrieval are two core human-centric tasks. In the recognition task, the challenge is specifying attributes depending on a person's appearance, while the retrieval task involves searching for matching persons based on attribute queries. There is a significant relationship between recognition and retrieval tasks. In this study, we demonstrate that if there is a sufficiently robust network to solve person attribute recognition, it can be adapted to facilitate better performance for the retrieval task. Another issue that needs addressing in the retrieval task is the modality gap between attribute queries and persons' images. Therefore, in this paper, we present CLEAR, a unified network designed to address both tasks. We introduce a robust cross-transformers network to handle person attribute recognition. Additionally, leveraging a pre-trained language model, we construct pseudo-descriptions for attribute queries and introduce an effective training strategy to train only a few additional parameters for adapters, facilitating the handling of the retrieval task. Finally, the unified CLEAR model is evaluated on five benchmarks: PETA, PA100K, Market-1501, RAPv2, and UPAR-2024. Without bells and whistles, CLEAR achieves state-of-the-art performance or competitive results for both tasks, significantly outperforming other competitors in terms of person retrieval performance on the widely-used Market-1501 dataset.
△ Less
Submitted 30 April, 2024; v1 submitted 10 March, 2024;
originally announced March 2024.
-
RepoHyper: Better Context Retrieval Is All You Need for Repository-Level Code Completion
Authors:
Huy N. Phan,
Hoang N. Phan,
Tien N. Nguyen,
Nghi D. Q. Bui
Abstract:
Code Large Language Models (CodeLLMs) have demonstrated impressive proficiency in code completion tasks. However, they often fall short of fully understanding the extensive context of a project repository, such as the intricacies of relevant files and class hierarchies, which can result in less precise completions. To overcome these limitations, we present \tool, a multifaceted framework designed…
▽ More
Code Large Language Models (CodeLLMs) have demonstrated impressive proficiency in code completion tasks. However, they often fall short of fully understanding the extensive context of a project repository, such as the intricacies of relevant files and class hierarchies, which can result in less precise completions. To overcome these limitations, we present \tool, a multifaceted framework designed to address the complex challenges associated with repository-level code completion. Central to \tool is the {\em Repo-level Semantic Graph} (RSG), a novel semantic graph structure that encapsulates the vast context of code repositories. Furthermore, RepoHyper leverages \textit{Expand and Refine} retrieval method, including a graph expansion and a link prediction algorithm applied to the RSG, enabling the effective retrieval and prioritization of relevant code snippets. Our evaluations show that \tool markedly outperforms existing techniques in repository-level code completion, showcasing enhanced accuracy across various datasets when compared to several strong baselines. Our implementation of RepoHyper can be found at~\url{https://github.com/FSoft-AI4Code/RepoHyper}.
△ Less
Submitted 16 March, 2024; v1 submitted 10 March, 2024;
originally announced March 2024.
-
Physics-based material parameters extraction from perovskite experiments via Bayesian optimization
Authors:
Hualin Zhan,
Viqar Ahmad,
Azul Mayon,
Grace Tabi,
Anh Dinh Bui,
Zhuofeng Li,
Daniel Walter,
Hieu Nguyen,
Klaus Weber,
Thomas White,
Kylie Catchpole
Abstract:
The ability to extract material parameters of perovskite from quantitative experimental analysis is essential for rational design of photovoltaic and optoelectronic applications. However, the difficulty of this analysis increases significantly with the complexity of the theoretical model and the number of material parameters for perovskite. Here we use Bayesian optimization to develop an analysis…
▽ More
The ability to extract material parameters of perovskite from quantitative experimental analysis is essential for rational design of photovoltaic and optoelectronic applications. However, the difficulty of this analysis increases significantly with the complexity of the theoretical model and the number of material parameters for perovskite. Here we use Bayesian optimization to develop an analysis platform that can extract up to 8 fundamental material parameters of an organometallic perovskite semiconductor from a transient photoluminescence experiment, based on a complex full physics model that includes drift-diffusion of carriers and dynamic defect occupation. An example study of thermal degradation reveals that the carrier mobility and trap-assisted recombination coefficient are reduced noticeably, while the defect energy level remains nearly unchanged. The reduced carrier mobility can dominate the overall effect on thermal degradation of perovskite solar cells by reducing the fill factor, despite the opposite effect of the reduced trap-assisted recombination coefficient on increasing the fill factor. In future, this platform can be conveniently applied to other experiments or to combinations of experiments, accelerating materials discovery and optimization of semiconductor materials for photovoltaics and other applications.
△ Less
Submitted 29 May, 2024; v1 submitted 16 February, 2024;
originally announced February 2024.
-
Ant Colony Optimization for Cooperative Inspection Path Planning Using Multiple Unmanned Aerial Vehicles
Authors:
Duy Nam Bui,
Thuy Ngan Duong,
Manh Duong Phung
Abstract:
This paper presents a new swarm intelligence-based approach to deal with the cooperative path planning problem of unmanned aerial vehicles (UAVs), which is essential for the automatic inspection of infrastructure. The approach uses a 3D model of the structure to generate viewpoints for the UAVs. The calculation of the viewpoints considers the constraints related to the UAV formation model, camera…
▽ More
This paper presents a new swarm intelligence-based approach to deal with the cooperative path planning problem of unmanned aerial vehicles (UAVs), which is essential for the automatic inspection of infrastructure. The approach uses a 3D model of the structure to generate viewpoints for the UAVs. The calculation of the viewpoints considers the constraints related to the UAV formation model, camera parameters, and requirements for data post-processing. The viewpoints are then used as input to formulate the path planning as an extended traveling salesman problem and the definition of a new cost function. Ant colony optimization is finally used to solve the problem to yield optimal inspection paths. Experiments with 3D models of real structures have been conducted to evaluate the performance of the proposed approach. The results show that our system is not only capable of generating feasible inspection paths for UAVs but also reducing the path length by 29.47\% for complex structures when compared with another heuristic approach. The source code of the algorithm can be found at https://github.com/duynamrcv/aco_3d_ipp.
△ Less
Submitted 13 February, 2024;
originally announced February 2024.
-
Self-Reconfigurable V-shape Formation of Multiple UAVs in Narrow Space Environments
Authors:
Duy Nam Bui,
Manh Duong Phung,
Hung Pham Duy
Abstract:
This paper presents the design and implementation of a self-reconfigurable V-shape formation controller for multiple unmanned aerial vehicles (UAVs) navigating through narrow spaces in a dense obstacle environment. The selection of the V-shape formation is motivated by its maneuverability and visibility advantages. The main objective is to develop an effective formation control strategy that allow…
▽ More
This paper presents the design and implementation of a self-reconfigurable V-shape formation controller for multiple unmanned aerial vehicles (UAVs) navigating through narrow spaces in a dense obstacle environment. The selection of the V-shape formation is motivated by its maneuverability and visibility advantages. The main objective is to develop an effective formation control strategy that allows UAVs to autonomously adjust their positions to form the desired formation while navigating through obstacles. To achieve this, we propose a distributed behavior-based control algorithm that combines the behaviors designed for individual UAVs so that they together navigate the UAVs to their desired positions. The reconfiguration process is automatic, utilizing individual UAV sensing within the formation, allowing for dynamic adaptations such as opening/closing wings or merging into a straight line. Simulation results show that the self-reconfigurable V-shape formation offers adaptability and effectiveness for UAV formations in complex operational scenarios.
△ Less
Submitted 13 February, 2024;
originally announced February 2024.
-
Mechanical Attributes of Fractal Dragons
Authors:
Huy T. Q. Phan,
Duc M. Bui,
Cong T. Than,
Trung V. Phan
Abstract:
Fractals are ubiquitous natural emergences that have gained increased attention in engineering applications, thanks to recent technological advancements enabling the fabrication of structures spanning across many spatial scales. We show how the geometries of fractals can be exploited to determine their important mechanical properties, such as the first and second moments, which physically correspo…
▽ More
Fractals are ubiquitous natural emergences that have gained increased attention in engineering applications, thanks to recent technological advancements enabling the fabrication of structures spanning across many spatial scales. We show how the geometries of fractals can be exploited to determine their important mechanical properties, such as the first and second moments, which physically correspond to the center of mass and the moment of inertia, using a family of complex fractals known as the dragons.
△ Less
Submitted 12 November, 2023;
originally announced November 2023.
-
Functional Overlap Reranking for Neural Code Generation
Authors:
Hung Quoc To,
Minh Huynh Nguyen,
Nghi D. Q. Bui
Abstract:
Code Large Language Models (CodeLLMs) have ushered in a new era in code generation advancements. However, selecting the best code solutions from all possible CodeLLM outputs remains a challenge. Previous methods often overlooked the intricate functional similarities and interactions between solution clusters. We introduce SRank, a novel reranking strategy for selecting the best solutions from code…
▽ More
Code Large Language Models (CodeLLMs) have ushered in a new era in code generation advancements. However, selecting the best code solutions from all possible CodeLLM outputs remains a challenge. Previous methods often overlooked the intricate functional similarities and interactions between solution clusters. We introduce SRank, a novel reranking strategy for selecting the best solutions from code generation, focusing on modeling the relationships between clusters of solutions. By quantifying the functional overlap between solution clusters, our approach provides a better ranking strategy for code solutions. Empirical results show that our method achieves remarkable results on the pass@1 score. For instance, on the Human-Eval benchmark, we achieve 69.66% in pass@1 with Codex002, 75.31% with WizardCoder, 53.99% with StarCoder, and 60.55% with CodeGen, surpassing state-of-the-art code generation reranking methods such as CodeT and Coder-Reviewer on the same CodeLLM by a significant margin (approximately 6.1% improvement on average). Even in scenarios with a limited number of sampled solutions and test cases, our approach demonstrates robustness and superiority, marking a new benchmark in code generation reranking. Our implementation can be found at https://github.com/FSoft-AI4Code/SRank-CodeRanker.
△ Less
Submitted 22 June, 2024; v1 submitted 16 October, 2023;
originally announced November 2023.
-
Real-Time Magnetic Tracking and Diagnosis of COVID-19 via Machine Learning
Authors:
Dang Nguyen,
Phat K. Huynh,
Vinh Duc An Bui,
Kee Young Hwang,
Nityanand Jain,
Chau Nguyen,
Le Huu Nhat Minh,
Le Van Truong,
Xuan Thanh Nguyen,
Dinh Hoang Nguyen,
Le Tien Dung,
Trung Q. Le,
Manh-Huong Phan
Abstract:
The COVID-19 pandemic underscored the importance of reliable, noninvasive diagnostic tools for robust public health interventions. In this work, we fused magnetic respiratory sensing technology (MRST) with machine learning (ML) to create a diagnostic platform for real-time tracking and diagnosis of COVID-19 and other respiratory diseases. The MRST precisely captures breathing patterns through thre…
▽ More
The COVID-19 pandemic underscored the importance of reliable, noninvasive diagnostic tools for robust public health interventions. In this work, we fused magnetic respiratory sensing technology (MRST) with machine learning (ML) to create a diagnostic platform for real-time tracking and diagnosis of COVID-19 and other respiratory diseases. The MRST precisely captures breathing patterns through three specific breath testing protocols: normal breath, holding breath, and deep breath. We collected breath data from both COVID-19 patients and healthy subjects in Vietnam using this platform, which then served to train and validate ML models. Our evaluation encompassed multiple ML algorithms, including support vector machines and deep learning models, assessing their ability to diagnose COVID-19. Our multi-model validation methodology ensures a thorough comparison and grants the adaptability to select the most optimal model, striking a balance between diagnostic precision with model interpretability. The findings highlight the exceptional potential of our diagnostic tool in pinpointing respiratory anomalies, achieving over 90% accuracy. This innovative sensor technology can be seamlessly integrated into healthcare settings for patient monitoring, marking a significant enhancement for the healthcare infrastructure.
△ Less
Submitted 1 November, 2023;
originally announced November 2023.
-
DocChecker: Bootstrap** Code Large Language Model for Detecting and Resolving Code-Comment Inconsistencies
Authors:
Anh T. V. Dau,
** L. C. Guo,
Nghi D. Q. Bui
Abstract:
Comments within source code are essential for developers to comprehend the code's purpose and ensure its correct usage. However, as codebases evolve, maintaining an accurate alignment between the comments and the code becomes increasingly challenging. Recognizing the growing interest in automated solutions for detecting and correcting differences between code and its accompanying comments, current…
▽ More
Comments within source code are essential for developers to comprehend the code's purpose and ensure its correct usage. However, as codebases evolve, maintaining an accurate alignment between the comments and the code becomes increasingly challenging. Recognizing the growing interest in automated solutions for detecting and correcting differences between code and its accompanying comments, current methods rely primarily on heuristic rules. In contrast, this paper presents DocChecker, a tool powered by deep learning. DocChecker is adept at identifying inconsistencies between code and comments, and it can also generate synthetic comments. This capability enables the tool to detect and correct instances where comments do not accurately reflect their corresponding code segments. We demonstrate the effectiveness of DocChecker using the Just-In-Time and CodeXGlue datasets in different settings. Particularly, DocChecker achieves a new State-of-the-art result of 72.3% accuracy on the Inconsistency Code-Comment Detection (ICCD) task and 33.64 BLEU-4 on the code summarization task against other Large Language Models (LLMs), even surpassing GPT 3.5 and CodeLlama.
DocChecker is accessible for use and evaluation. It can be found on our GitHub https://github.com/FSoft-AI4Code/DocChecker and as an Online Tool http://4.193.50.237:5000/. For a more comprehensive understanding of its functionality, a demonstration video is available on YouTube https://youtu.be/FqnPmd531xw.
△ Less
Submitted 2 February, 2024; v1 submitted 10 June, 2023;
originally announced June 2023.
-
CodeTF: One-stop Transformer Library for State-of-the-art Code LLM
Authors:
Nghi D. Q. Bui,
Hung Le,
Yue Wang,
Junnan Li,
Akhilesh Deepak Gotmare,
Steven C. H. Hoi
Abstract:
Code intelligence plays a key role in transforming modern software engineering. Recently, deep learning-based models, especially Transformer-based large language models (LLMs), have demonstrated remarkable potential in tackling these tasks by leveraging massive open-source code data and programming language features. However, the development and deployment of such models often require expertise in…
▽ More
Code intelligence plays a key role in transforming modern software engineering. Recently, deep learning-based models, especially Transformer-based large language models (LLMs), have demonstrated remarkable potential in tackling these tasks by leveraging massive open-source code data and programming language features. However, the development and deployment of such models often require expertise in both machine learning and software engineering, creating a barrier for the model adoption. In this paper, we present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence. Following the principles of modular design and extensible framework, we design CodeTF with a unified interface to enable rapid access and development across different types of models, datasets and tasks. Our library supports a collection of pretrained Code LLM models and popular code benchmarks, including a standardized interface to train and serve code LLMs efficiently, and data features such as language-specific parsers and utility functions for extracting code attributes. In this paper, we describe the design principles, the architecture, key modules and components, and compare with other related library tools. Finally, we hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering, providing a comprehensive open-source solution for developers, researchers, and practitioners.
△ Less
Submitted 31 May, 2023;
originally announced June 2023.
-
CodeT5+: Open Code Large Language Models for Code Understanding and Generation
Authors:
Yue Wang,
Hung Le,
Akhilesh Deepak Gotmare,
Nghi D. Q. Bui,
Junnan Li,
Steven C. H. Hoi
Abstract:
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limi…
▽ More
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. However, existing code LLMs have two main limitations in terms of architecture and pretraining tasks. First, they often adopt a specific architecture (encoder-only or decoder-only) or rely on a unified encoder-decoder network for different downstream tasks. The former paradigm is limited by inflexibility in applications while in the latter, the model is treated as a single system for all tasks, leading to suboptimal performance on a subset of tasks. Secondly, they often employ a limited set of pretraining objectives which might not be relevant to some downstream tasks and hence result in substantial performance degrade. To address these limitations, we propose ``CodeT5+'', a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. Such flexibility is enabled by our proposed mixture of pretraining objectives to mitigate the pretrain-finetune discrepancy. These objectives cover span denoising, contrastive learning, text-code matching, and causal LM pretraining tasks, on both unimodal and bimodal multilingual code corpora. Furthermore, we propose to initialize CodeT5+ with frozen off-the-shelf LLMs without training from scratch to efficiently scale up our models, and explore instruction-tuning to align with natural language instructions. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning. We observe state-of-the-art (SoTA) model performance on various code-related tasks, such as code generation and completion, math programming, and text-to-code retrieval tasks. Particularly, our instruction-tuned CodeT5+ 16B achieves new SoTA results on HumanEval code generation task against other open code LLMs.
△ Less
Submitted 20 May, 2023; v1 submitted 13 May, 2023;
originally announced May 2023.
-
The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
Authors:
Dung Nguyen Manh,
Nam Le Hai,
Anh T. V. Dau,
Anh Minh Nguyen,
Khanh Nghiem,
** Guo,
Nghi D. Q. Bui
Abstract:
We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We present methods for thoroughly extracting samples that use both rule-based and deep learning-based methods to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text…
▽ More
We present The Vault, a dataset of high-quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We present methods for thoroughly extracting samples that use both rule-based and deep learning-based methods to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 43 million high-quality code-text pairs. Our extensive evaluations on common coding tasks including code generation, code search and code summarization show that when fine-tuning Code Large Language Models on The Vault, such models outperform the same models trained on other datasets such as CodeSearchNet. We also provide detailed analyses of our datasets to assess the effects of various programming languages and docstrings on the performance of such models.
△ Less
Submitted 30 October, 2023; v1 submitted 9 May, 2023;
originally announced May 2023.
-
Development of a Vision System to Enhance the Reliability of the Pick-and-Place Robot for Autonomous Testing of Camera Module used in Smartphones
Authors:
Hoang-Anh Phan,
Duy Nam Bui,
Tuan Nguyen Dinh,
Bao-Anh Hoang,
An Nguyen Ngoc,
Dong Tran Huu Quoc,
Ha Tran Thi Thuy,
Tung Thanh Bui,
Van Nguyen Thi Thanh
Abstract:
Pick-and-place robots are commonly used in modern industrial manufacturing. For complex devices/parts like camera modules used in smartphones, which contain optical parts, electrical components and interfacing connectors, the placement operation may not absolutely accurate, which may cause damage in the device under test during the mechanical movement to make good contact for electrical functions…
▽ More
Pick-and-place robots are commonly used in modern industrial manufacturing. For complex devices/parts like camera modules used in smartphones, which contain optical parts, electrical components and interfacing connectors, the placement operation may not absolutely accurate, which may cause damage in the device under test during the mechanical movement to make good contact for electrical functions inspection. In this paper, we proposed an effective vision system including hardware and algorithm to enhance the reliability of the pick-and-place robot for autonomous testing memory of camera modules. With limited hardware based on camera and raspberry PI and using simplify image processing algorithm based on histogram information, the vision system can confirm the presence of the camera modules in feeding tray and the placement accuracy of the camera module in test socket. Through that, the system can work with more flexibility and avoid damaging the device under test. The system was experimentally quantified through testing approximately 2000 camera modules in a stable light condition. Experimental results demonstrate that the system achieves accuracy of more than 99.92%. With its simplicity and effectiveness, the proposed vision system can be considered as a useful solution for using in pick-and-place systems in industry.
△ Less
Submitted 8 May, 2023;
originally announced May 2023.
-
UIT-OpenViIC: A Novel Benchmark for Evaluating Image Captioning in Vietnamese
Authors:
Doanh C. Bui,
Nghia Hieu Nguyen,
Khang Nguyen
Abstract:
Image Captioning is one of the vision-language tasks that still interest the research community worldwide in the 2020s. MS-COCO Caption benchmark is commonly used to evaluate the performance of advanced captioning models, although it was published in 2015. Recent captioning models trained on the MS-COCO Caption dataset only have good performance in language patterns of English; they do not have su…
▽ More
Image Captioning is one of the vision-language tasks that still interest the research community worldwide in the 2020s. MS-COCO Caption benchmark is commonly used to evaluate the performance of advanced captioning models, although it was published in 2015. Recent captioning models trained on the MS-COCO Caption dataset only have good performance in language patterns of English; they do not have such good performance in contexts captured in Vietnam or fluently caption images using Vietnamese. To contribute to the low-resources research community as in Vietnam, we introduce a novel image captioning dataset in Vietnamese, the Open-domain Vietnamese Image Captioning dataset (UIT-OpenViIC). The introduced dataset includes complex scenes captured in Vietnam and manually annotated by Vietnamese under strict rules and supervision. In this paper, we present in more detail the dataset creation process. From preliminary analysis, we show that our dataset is challenging to recent state-of-the-art (SOTA) Transformer-based baselines, which performed well on the MS COCO dataset. Then, the modest results prove that UIT-OpenViIC has room to grow, which can be one of the standard benchmarks in Vietnamese for the research community to evaluate their captioning models. Furthermore, we present a CAMO approach that effectively enhances the image representation ability by a multi-level encoder output fusion mechanism, which helps improve the quality of generated captions compared to previous captioning models.
△ Less
Submitted 9 May, 2023; v1 submitted 6 May, 2023;
originally announced May 2023.
-
Class based Influence Functions for Error Detection
Authors:
Thang Nguyen-Duc,
Hoang Thanh-Tung,
Quan Hung Tran,
Dang Huu-Tien,
Hieu Ngoc Nguyen,
Anh T. V. Dau,
Nghi D. Q. Bui
Abstract:
Influence functions (IFs) are a powerful tool for detecting anomalous examples in large scale datasets. However, they are unstable when applied to deep networks. In this paper, we provide an explanation for the instability of IFs and develop a solution to this problem. We show that IFs are unreliable when the two data points belong to two different classes. Our solution leverages class information…
▽ More
Influence functions (IFs) are a powerful tool for detecting anomalous examples in large scale datasets. However, they are unstable when applied to deep networks. In this paper, we provide an explanation for the instability of IFs and develop a solution to this problem. We show that IFs are unreliable when the two data points belong to two different classes. Our solution leverages class information to improve the stability of IFs. Extensive experiments show that our modification significantly improves the performance and stability of IFs while incurring no additional computational cost.
△ Less
Submitted 2 May, 2023;
originally announced May 2023.
-
CarGameAR: An Integrated AR Car Game Authoring Interface for Custom-Built Car Programed on Arduino Board
Authors:
Dang Bui,
Wanwan Li,
Hong Huang
Abstract:
In this paper, we present CarGameAR: An Integrated AR Car Game Authoring Interface for Custom-Built Car Programed on Arduino Board. The car consists of an Arduino board, an H-bridge, and motors. The objective of the project is to create a system that can move a car in different directions using a computer application. The system uses Unity software to create a virtual environment where the user ca…
▽ More
In this paper, we present CarGameAR: An Integrated AR Car Game Authoring Interface for Custom-Built Car Programed on Arduino Board. The car consists of an Arduino board, an H-bridge, and motors. The objective of the project is to create a system that can move a car in different directions using a computer application. The system uses Unity software to create a virtual environment where the user can control the car using keyboard commands. The car's motion is achieved by sending signals from the computer to the Arduino board, which then drives the motors through the H-bridge. The project provides a cost-effective and efficient way to build a car, which can be used for educational purposes, such as teaching programming. Moreover, this project is not limited to the control of the car through keyboard commands in a virtual environment. The system can be adapted to support augmented reality (AR) technology, providing an even more immersive and engaging user experience. By integrating the car with AR, the user can control the car's motion using physical gestures and movements, adding an extra layer of interactivity to the system. This makes the car an ideal platform for game development in AR, allowing the user to create driving games that blend the physical and virtual worlds seamlessly. Additionally, the car's affordability and ease of construction make it an accessible and valuable tool for teaching programming and principles in a fun and interactive way. Overall, this project demonstrates the versatility and potential of the car system, highlighting the various applications and possibilities it offers for both education and entertainment.
△ Less
Submitted 28 April, 2023;
originally announced May 2023.
-
Better Language Models of Code through Self-Improvement
Authors:
Hung Quoc To,
Nghi D. Q. Bui,
** Guo,
Tien N. Nguyen
Abstract:
Pre-trained language models for code (PLMCs) have gained attention in recent research. These models are pre-trained on large-scale datasets using multi-modal objectives. However, fine-tuning them requires extensive supervision and is limited by the size of the dataset provided. We aim to improve this issue by proposing a simple data augmentation framework. Our framework utilizes knowledge gained d…
▽ More
Pre-trained language models for code (PLMCs) have gained attention in recent research. These models are pre-trained on large-scale datasets using multi-modal objectives. However, fine-tuning them requires extensive supervision and is limited by the size of the dataset provided. We aim to improve this issue by proposing a simple data augmentation framework. Our framework utilizes knowledge gained during the pre-training and fine-tuning stage to generate pseudo data, which is then used as training data for the next step. We incorporate this framework into the state-of-the-art language models, such as CodeT5, CodeBERT, and UnixCoder. The results show that our framework significantly improves PLMCs' performance in code-related sequence generation tasks, such as code summarization and code generation in the CodeXGLUE benchmark.
△ Less
Submitted 9 May, 2023; v1 submitted 2 April, 2023;
originally announced April 2023.
-
Deployment of UAVs for Optimal Multihop Ad-hoc Networks Using Particle Swarm Optimization and Behavior-based Control
Authors:
Ngan Duong Thi Thuy,
Duy Nam Bui,
Manh Duong Phung,
Hung Pham Duy
Abstract:
This study proposes an approach for establishing an optimal multihop ad-hoc network using multiple unmanned aerial vehicles (UAVs) to provide emergency communication in disaster areas. The approach includes two stages, one uses particle swarm optimization (PSO) to find optimal positions to deploy UAVs, and the other uses a behavior-based controller to navigate the UAVs to their assigned positions…
▽ More
This study proposes an approach for establishing an optimal multihop ad-hoc network using multiple unmanned aerial vehicles (UAVs) to provide emergency communication in disaster areas. The approach includes two stages, one uses particle swarm optimization (PSO) to find optimal positions to deploy UAVs, and the other uses a behavior-based controller to navigate the UAVs to their assigned positions without colliding with obstacles in an unknown environment. Several constraints related to the UAVs' sensing and communication ranges have been imposed to ensure the applicability of the proposed approach in real-world scenarios. A number of simulation experiments with data loaded from real environments have been conducted. The results show that our proposed approach is not only successful in establishing multihop ad-hoc routes but also meets the requirements for real-time deployment of UAVs.
△ Less
Submitted 26 December, 2022;
originally announced December 2022.
-
Detect-Localize-Repair: A Unified Framework for Learning to Debug with CodeT5
Authors:
Nghi D. Q. Bui,
Yue Wang,
Steven Hoi
Abstract:
Automated software debugging is a crucial task for improving the productivity of software developers. Many neural-based techniques have been proven effective for debugging-related tasks such as bug localization and program repair (or bug fixing). However, these techniques often focus only on either one of them or approach them in a stage-wise manner, ignoring the mutual benefits between them. In t…
▽ More
Automated software debugging is a crucial task for improving the productivity of software developers. Many neural-based techniques have been proven effective for debugging-related tasks such as bug localization and program repair (or bug fixing). However, these techniques often focus only on either one of them or approach them in a stage-wise manner, ignoring the mutual benefits between them. In this work, we propose a novel unified \emph{Detect-Localize-Repair} framework based on a pretrained programming language model CodeT5 to seamlessly address these tasks, named CodeT5-DLR. Specifically, we propose three objectives to adapt the generic CodeT5 for debugging: a bug detection objective to determine whether a given code snippet is buggy or not, a bug localization objective to identify the buggy lines, and a program repair objective to translate the buggy code to its fixed version. We evaluate it on each of these tasks and their combined setting on two newly collected line-level debugging datasets in Java and Python. Extensive results show that our model significantly outperforms existing baselines from both NLP and software engineering domains.
△ Less
Submitted 22 December, 2022; v1 submitted 27 November, 2022;
originally announced November 2022.
-
Optimal sizing of renewable energy storage: A comparative study of hydrogen and battery system considering degradation and seasonal storage
Authors:
Son Tay Le,
Tuan Ngoc Nguyen,
Dac-Khuong Bui,
Tuan Duc Ngo
Abstract:
Renewable energy storage (RES) is essential to address the intermittence issues of renewable energy systems, thereby enhancing the system stability and reliability. This study presents an optimisation study of sizing and operational strategy parameters of a grid-connected photovoltaic (PV)-hydrogen/battery systems using a Multi-Objective Modified Firefly Algorithm (MOMFA). An operational strategy…
▽ More
Renewable energy storage (RES) is essential to address the intermittence issues of renewable energy systems, thereby enhancing the system stability and reliability. This study presents an optimisation study of sizing and operational strategy parameters of a grid-connected photovoltaic (PV)-hydrogen/battery systems using a Multi-Objective Modified Firefly Algorithm (MOMFA). An operational strategy that utilises the ability of hydrogen to store energy over a long time was also investigated. The proposed method was applied to a real-world distributed energy project located in the tropical climate zone. To further demonstrate the robustness and versatility of the method, another synthetic test case was examined for a location in the subtropical weather zone, which has a high seasonal mismatch. The performance of the proposed MOMFA method is compared with the NSGA-II method, which has been widely used to design renewable energy storage systems in the literature. The result shows that MOMFA is more accurate and robust than NSGA-II owing to the complex and dynamic nature of energy storage system. The optimisation results show that battery storage systems, as a mature technology, yield better economic performance than current hydrogen storage systems. However, it is proven that hydrogen storage systems provide better techno-economic performance and can be a viable long-term storage solution when high penetration of renewable energy is required. The study also proves that the proposed long-term operational strategy can lower component degradation, enhance efficiency, and increase the total economic performance of hydrogen storage systems. The findings of this study can support the implementation of energy storage systems for renewable energy.
△ Less
Submitted 14 November, 2022;
originally announced November 2022.
-
SG-Shuffle: Multi-aspect Shuffle Transformer for Scene Graph Generation
Authors:
Anh Duc Bui,
Soyeon Caren Han,
Josiah Poon
Abstract:
Scene Graph Generation (SGG) serves a comprehensive representation of the images for human understanding as well as visual understanding tasks. Due to the long tail bias problem of the object and predicate labels in the available annotated data, the scene graph generated from current methodologies can be biased toward common, non-informative relationship labels. Relationship can sometimes be non-m…
▽ More
Scene Graph Generation (SGG) serves a comprehensive representation of the images for human understanding as well as visual understanding tasks. Due to the long tail bias problem of the object and predicate labels in the available annotated data, the scene graph generated from current methodologies can be biased toward common, non-informative relationship labels. Relationship can sometimes be non-mutually exclusive, which can be described from multiple perspectives like geometrical relationships or semantic relationships, making it even more challenging to predict the most suitable relationship label. In this work, we proposed the SG-Shuffle pipeline for scene graph generation with 3 components: 1) Parallel Transformer Encoder, which learns to predict object relationships in a more exclusive manner by grou** relationship labels into groups of similar purpose; 2) Shuffle Transformer, which learns to select the final relationship labels from the category-specific feature generated in the previous step; and 3) Weighted CE loss, used to alleviate the training bias caused by the imbalanced dataset.
△ Less
Submitted 9 November, 2022;
originally announced November 2022.
-
Lyapunov-based Nonlinear Model Predictive Control for Attitude Trajectory Tracking of Unmanned Aerial Vehicles
Authors:
Duy Nam Bui,
Thi Thanh Van Nguyen,
Manh Duong Phung
Abstract:
This paper presents a new Lyapunov-based nonlinear model predictive controller (LNMPC) for the attitude control problem of unmanned aerial vehicles (UAVs), which is essential for their functioning operation. The controller is designed based on a quadratic cost function integrating UAV dynamics and system constraints. An additional contraction constraint is then introduced to ensure closed-loop sys…
▽ More
This paper presents a new Lyapunov-based nonlinear model predictive controller (LNMPC) for the attitude control problem of unmanned aerial vehicles (UAVs), which is essential for their functioning operation. The controller is designed based on a quadratic cost function integrating UAV dynamics and system constraints. An additional contraction constraint is then introduced to ensure closed-loop system stability. That constraint is fulfilled via a Lyapunov function derived from a sliding mode controller (SMC). The feasibility and stability of the LNMPC are finally proved. Simulation and comparison results show that the proposed controller guarantees the system stability and outperforms other state-of-the-art nonlinear controllers such as the backstep** controller (BSC) and SMC. In addition, the proposed controller can be integrated into an existing UAV model in the Gazebo simulator to perform software-in-the-loop tests. The results show that the LNMPC is better than the built-in PID controller of the UAV, which confirms the validity and applicability of our proposed approach.
△ Less
Submitted 28 October, 2022;
originally announced October 2022.
-
Depth Perspective-aware Multiple Object Tracking
Authors:
Kha Gia Quach,
Huu Le,
Pha Nguyen,
Chi Nhan Duong,
Tien Dai Bui,
Khoa Luu
Abstract:
This paper aims to tackle Multiple Object Tracking (MOT), an important problem in computer vision but remains challenging due to many practical issues, especially occlusions. Indeed, we propose a new real-time Depth Perspective-aware Multiple Object Tracking (DP-MOT) approach to tackle the occlusion problem in MOT. A simple yet efficient Subject-Ordered Depth Estimation (SODE) is first proposed to…
▽ More
This paper aims to tackle Multiple Object Tracking (MOT), an important problem in computer vision but remains challenging due to many practical issues, especially occlusions. Indeed, we propose a new real-time Depth Perspective-aware Multiple Object Tracking (DP-MOT) approach to tackle the occlusion problem in MOT. A simple yet efficient Subject-Ordered Depth Estimation (SODE) is first proposed to automatically order the depth positions of detected subjects in a 2D scene in an unsupervised manner. Using the output from SODE, a new Active pseudo-3D Kalman filter, a simple but effective extension of Kalman filter with dynamic control variables, is then proposed to dynamically update the movement of objects. In addition, a new high-order association approach is presented in the data association step to incorporate first-order and second-order relationships between the detected objects. The proposed approach consistently achieves state-of-the-art performance compared to recent MOT methods on standard MOT benchmarks.
△ Less
Submitted 27 February, 2023; v1 submitted 10 July, 2022;
originally announced July 2022.
-
HierarchyNet: Learning to Summarize Source Code with Heterogeneous Representations
Authors:
Minh Huynh Nguyen,
Nghi D. Q. Bui,
Truong Son Hy,
Long Tran-Thanh,
Tien N. Nguyen
Abstract:
We propose a novel method for code summarization utilizing Heterogeneous Code Representations (HCRs) and our specially designed HierarchyNet. HCRs effectively capture essential code features at lexical, syntactic, and semantic levels by abstracting coarse-grained code elements and incorporating fine-grained program elements in a hierarchical structure. Our HierarchyNet method processes each layer…
▽ More
We propose a novel method for code summarization utilizing Heterogeneous Code Representations (HCRs) and our specially designed HierarchyNet. HCRs effectively capture essential code features at lexical, syntactic, and semantic levels by abstracting coarse-grained code elements and incorporating fine-grained program elements in a hierarchical structure. Our HierarchyNet method processes each layer of the HCR separately through a unique combination of the Heterogeneous Graph Transformer, a Tree-based CNN, and a Transformer Encoder. This approach preserves dependencies between code elements and captures relations through a novel Hierarchical-Aware Cross Attention layer. Our method surpasses current state-of-the-art techniques, such as PA-Former, CAST, and NeuralCodeSum.
△ Less
Submitted 9 May, 2023; v1 submitted 30 May, 2022;
originally announced May 2022.
-
Towards Using Data-Influence Methods to Detect Noisy Samples in Source Code Corpora
Authors:
Anh T. V. Dau,
Thang Nguyen-Duc,
Hoang Thanh-Tung,
Nghi D. Q. Bui
Abstract:
Despite the recent trend of develo** and applying neural source code models to software engineering tasks, the quality of such models is insufficient for real-world use. This is because there could be noise in the source code corpora used to train such models. We adapt data-influence methods to detect such noises in this paper. Data-influence methods are used in machine learning to evaluate the…
▽ More
Despite the recent trend of develo** and applying neural source code models to software engineering tasks, the quality of such models is insufficient for real-world use. This is because there could be noise in the source code corpora used to train such models. We adapt data-influence methods to detect such noises in this paper. Data-influence methods are used in machine learning to evaluate the similarity of a target sample to the correct samples in order to determine whether or not the target sample is noisy. Our evaluation results show that data-influence methods can identify noisy samples from neural code models in classification-based tasks. This approach will contribute to the larger vision of develo** better neural source code models from a data-centric perspective, which is a key driver for develo** useful source code models in practice.
△ Less
Submitted 2 October, 2022; v1 submitted 25 May, 2022;
originally announced May 2022.
-
Partitioned Variational Inference: A Framework for Probabilistic Federated Learning
Authors:
Matthew Ashman,
Thang D. Bui,
Cuong V. Nguyen,
Stratis Markou,
Adrian Weller,
Siddharth Swaroop,
Richard E. Turner
Abstract:
The proliferation of computing devices has brought about an opportunity to deploy machine learning models on new problem domains using previously inaccessible data. Traditional algorithms for training such models often require data to be stored on a single machine with compute performed by a single node, making them unsuitable for decentralised training on multiple devices. This deficiency has mot…
▽ More
The proliferation of computing devices has brought about an opportunity to deploy machine learning models on new problem domains using previously inaccessible data. Traditional algorithms for training such models often require data to be stored on a single machine with compute performed by a single node, making them unsuitable for decentralised training on multiple devices. This deficiency has motivated the development of federated learning algorithms, which allow multiple data owners to train collaboratively and use a shared model whilst kee** local data private. However, many of these algorithms focus on obtaining point estimates of model parameters, rather than probabilistic estimates capable of capturing model uncertainty, which is essential in many applications. Variational inference (VI) has become the method of choice for fitting many modern probabilistic models. In this paper we introduce partitioned variational inference (PVI), a general framework for performing VI in the federated setting. We develop new supporting theory for PVI, demonstrating a number of properties that make it an attractive choice for practitioners; use PVI to unify a wealth of fragmented, yet related literature; and provide empirical results that showcase the effectiveness of PVI in a variety of federated settings.
△ Less
Submitted 28 April, 2022; v1 submitted 24 February, 2022;
originally announced February 2022.
-
Two New Algorithms for Line Clip** in E2 and Their Comparison
Authors:
Vaclav Skala,
Duc Huy Bui
Abstract:
Many algorithms for clip** a line by a rectangular area or a convex polygon in E2 or by a non-convex or convex polyhedron in E3 have been published. The line segment clip** by the rectangular window in E2 is often restricted to the use of the Cohen-Sutherland (CS) algorithm or its modifications based on some presumptions like small clip** window or more sophisticated coding technique, etc. T…
▽ More
Many algorithms for clip** a line by a rectangular area or a convex polygon in E2 or by a non-convex or convex polyhedron in E3 have been published. The line segment clip** by the rectangular window in E2 is often restricted to the use of the Cohen-Sutherland (CS) algorithm or its modifications based on some presumptions like small clip** window or more sophisticated coding technique, etc. The line clip** problem solution is a bottleneck of many packages and applications and, therefore, it would be desirable to use the fastest algorithm even though it is more complex.
△ Less
Submitted 3 January, 2022;
originally announced January 2022.
-
A New Algorithm for Pyramidal Clip** of Line Segments in E3
Authors:
Vaclav Skala,
Duc Huy Bui
Abstract:
A new algorithm for clip** a line segment against a pyramid in E3 is presented. This algorithm avoids computation of intersection points which are not end-points of the output line segment. It also allows solving all cases more effectively. The performance of this algorithm is shown to be consistently better than existing algorithms, including the Cohen-Sutherland, Liang-Barsky and Cyrus-Beck al…
▽ More
A new algorithm for clip** a line segment against a pyramid in E3 is presented. This algorithm avoids computation of intersection points which are not end-points of the output line segment. It also allows solving all cases more effectively. The performance of this algorithm is shown to be consistently better than existing algorithms, including the Cohen-Sutherland, Liang-Barsky and Cyrus-Beck algorithms.
△ Less
Submitted 3 January, 2022;
originally announced January 2022.
-
Energy-bounded Learning for Robust Models of Code
Authors:
Nghi D. Q. Bui,
Yijun Yu
Abstract:
In programming, learning code representations has a variety of applications, including code classification, code search, comment generation, bug prediction, and so on. Various representations of code in terms of tokens, syntax trees, dependency graphs, code navigation paths, or a combination of their variants have been proposed, however, existing vanilla learning techniques have a major limitation…
▽ More
In programming, learning code representations has a variety of applications, including code classification, code search, comment generation, bug prediction, and so on. Various representations of code in terms of tokens, syntax trees, dependency graphs, code navigation paths, or a combination of their variants have been proposed, however, existing vanilla learning techniques have a major limitation in robustness, i.e., it is easy for the models to make incorrect predictions when the inputs are altered in a subtle way. To enhance the robustness, existing approaches focus on recognizing adversarial samples rather than on the valid samples that fall outside a given distribution, which we refer to as out-of-distribution (OOD) samples. Recognizing such OOD samples is the novel problem investigated in this paper. To this end, we propose to first augment the in=distribution datasets with out-of-distribution samples such that, when trained together, they will enhance the model's robustness. We propose the use of an energy-bounded learning objective function to assign a higher score to in-distribution samples and a lower score to out-of-distribution samples in order to incorporate such out-of-distribution samples into the training process of source code models. In terms of OOD detection and adversarial samples detection, our evaluation results demonstrate a greater robustness for existing source code models to become more accurate at recognizing OOD data while being more resistant to adversarial attacks at the same time. Furthermore, the proposed energy-bounded score outperforms all existing OOD detection scores by a large margin, including the softmax confidence score, the Mahalanobis score, and ODIN.
△ Less
Submitted 9 May, 2022; v1 submitted 20 December, 2021;
originally announced December 2021.
-
Monte Carlo calculation of the organ equivalent dose and effective dose due to immersion in a 16N beta source in air using the ICRP Reference Phantoms
Authors:
Jose M. Gomez-Ros,
Montserrat Moraleda,
Pedro Arce,
Duc-Ky Bui,
Thi-My-Linh Dang,
Laurent Desorgher,
Han Sung Kim,
Dragana Krstic,
Michal Kuc,
Ngoc-Thiem Le,
Yi-Kang Lee,
Ngoc-Quynh Nguyen,
Dragoslav Nikezic,
Katarzyna Tyminska,
Tomas Vrba
Abstract:
This work summarises the results of a comparison organized by EURADOS focused on the usage of the ICRP Reference Computational Phantoms. This activity aimed to provide training for the implementation of voxel phantoms in Monte Carlo radiation transport codes and the calculation of the dose equivalent in organs and the effective dose. This particular case describes a scenario of immersion in a 16N…
▽ More
This work summarises the results of a comparison organized by EURADOS focused on the usage of the ICRP Reference Computational Phantoms. This activity aimed to provide training for the implementation of voxel phantoms in Monte Carlo radiation transport codes and the calculation of the dose equivalent in organs and the effective dose. This particular case describes a scenario of immersion in a 16N beta source distributed in the air of a room with concrete walls where the phantom is located. Seven participants took part in the comparison of results using GEANT4, TRIPOLI-4 and MCNP family codes, and there was detected a general problem when calculating the dose to skeletal tissue and the remainder tissue. After a process of feedback with the participants the errors were corrected and the final results reached an agreement of +/-5%.
△ Less
Submitted 7 December, 2021;
originally announced December 2021.
-
InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees
Authors:
Nghi D. Q. Bui,
Yijun Yu,
Lingxiao Jiang
Abstract:
Building deep learning models on source code has found many successful software engineering applications, such as code search, code comment generation, bug detection, code migration, and so on. Current learning techniques, however, have a major drawback that these models are mostly trained on datasets labeled for particular downstream tasks, and code representations may not be suitable for other t…
▽ More
Building deep learning models on source code has found many successful software engineering applications, such as code search, code comment generation, bug detection, code migration, and so on. Current learning techniques, however, have a major drawback that these models are mostly trained on datasets labeled for particular downstream tasks, and code representations may not be suitable for other tasks. While some techniques produce representations from unlabeled code, they are far from satisfactory when applied to downstream tasks. Although certain techniques generate representations from unlabeled code when applied to downstream tasks they are far from satisfactory. This paper proposes InferCode to overcome the limitation by adapting the self-supervised learning mechanism to build source code model. The key novelty lies in training code representations by predicting automatically identified subtrees from the context of the ASTs. Subtrees in ASTs are treated with InferCode as the labels for training code representations without any human labeling effort or the overhead of expensive graph construction, and the trained representations are no longer tied to any specific downstream tasks or code units. We trained an InferCode model instance using the Tree-based CNN as the encoder of a large set of Java code and applied it to downstream unsupervised tasks such as code clustering, code clone detection, cross-language code search or reused under a transfer learning scheme to continue training the model weights for supervised tasks such as code classification and method name prediction. Compared to previous code learning techniques applied to the same downstream tasks, such as Code2Vec, Code2Seq, ASTNN, higher performance results are achieved using our pre-trained InferCode model with a significant margin for most tasks including those involving different programming languages.
△ Less
Submitted 15 December, 2020; v1 submitted 13 December, 2020;
originally announced December 2020.
-
Offset Curves Loss for Imbalanced Problem in Medical Segmentation
Authors:
Ngan Le,
Trung Le,
Kashu Yamazaki,
Toan Duc Bui,
Khoa Luu,
Marios Savides
Abstract:
Medical image segmentation has played an important role in medical analysis and widely developed for many clinical applications. Deep learning-based approaches have achieved high performance in semantic segmentation but they are limited to pixel-wise setting and imbalanced classes data problem. In this paper, we tackle those limitations by develo** a new deep learning-based model which takes int…
▽ More
Medical image segmentation has played an important role in medical analysis and widely developed for many clinical applications. Deep learning-based approaches have achieved high performance in semantic segmentation but they are limited to pixel-wise setting and imbalanced classes data problem. In this paper, we tackle those limitations by develo** a new deep learning-based model which takes into account both higher feature level i.e. region inside contour, intermediate feature level i.e. offset curves around the contour and lower feature level i.e. contour. Our proposed Offset Curves (OsC) loss consists of three main fitting terms. The first fitting term focuses on pixel-wise level segmentation whereas the second fitting term acts as attention model which pays attention to the area around the boundaries (offset curves). The third terms plays a role as regularization term which takes the length of boundaries into account. We evaluate our proposed OsC loss on both 2D network and 3D network. Two common medical datasets, i.e. retina DRIVE and brain tumor BRATS 2018 datasets are used to benchmark our proposed loss performance. The experiments have shown that our proposed OsC loss function outperforms other mainstream loss functions such as Cross-Entropy, Dice, Focal on the most common segmentation networks Unet, FCN.
△ Less
Submitted 4 December, 2020;
originally announced December 2020.
-
Flow-based Deformation Guidance for Unpaired Multi-Contrast MRI Image-to-Image Translation
Authors:
Toan Duc Bui,
Manh Nguyen,
Ngan Le,
Khoa Luu
Abstract:
Image synthesis from corrupted contrasts increases the diversity of diagnostic information available for many neurological diseases. Recently the image-to-image translation has experienced significant levels of interest within medical research, beginning with the successful use of the Generative Adversarial Network (GAN) to the introduction of cyclic constraint extended to multiple domains. Howeve…
▽ More
Image synthesis from corrupted contrasts increases the diversity of diagnostic information available for many neurological diseases. Recently the image-to-image translation has experienced significant levels of interest within medical research, beginning with the successful use of the Generative Adversarial Network (GAN) to the introduction of cyclic constraint extended to multiple domains. However, in current approaches, there is no guarantee that the map** between the two image domains would be unique or one-to-one. In this paper, we introduce a novel approach to unpaired image-to-image translation based on the invertible architecture. The invertible property of the flow-based architecture assures a cycle-consistency of image-to-image translation without additional loss functions. We utilize the temporal information between consecutive slices to provide more constraints to the optimization for transforming one domain to another in unpaired volumetric medical images. To capture temporal structures in the medical images, we explore the displacement between the consecutive slices using a deformation field. In our approach, the deformation field is used as a guidance to keep the translated slides realistic and consistent across the translation. The experimental results have shown that the synthesized images using our proposed approach are able to archive a competitive performance in terms of mean squared error, peak signal-to-noise ratio, and structural similarity index when compared with the existing deep learning-based methods on three standard datasets, i.e. HCP, MRBrainS13, and Brats2019.
△ Less
Submitted 3 December, 2020;
originally announced December 2020.
-
TreeCaps: Tree-Based Capsule Networks for Source Code Processing
Authors:
Nghi D. Q. Bui,
Yijun Yu,
Lingxiao Jiang
Abstract:
Recently program learning techniques have been proposed to process source code based on syntactical structures (e.g., Abstract Syntax Trees) and/or semantic information (e.g., Dependency Graphs). Although graphs may be better at capturing various viewpoints of code semantics than trees, constructing graph inputs from code needs static code semantic analysis that may not be accurate and introduces…
▽ More
Recently program learning techniques have been proposed to process source code based on syntactical structures (e.g., Abstract Syntax Trees) and/or semantic information (e.g., Dependency Graphs). Although graphs may be better at capturing various viewpoints of code semantics than trees, constructing graph inputs from code needs static code semantic analysis that may not be accurate and introduces noise during learning. Although syntax trees are precisely defined according to the language grammar and easier to construct and process than graphs, previous tree-based learning techniques have not been able to learn semantic information from trees to achieve better accuracy than graph-based techniques. We propose a new learning technique, named TreeCaps, by fusing together capsule networks with tree-based convolutional neural networks, to achieve learning accuracy higher than existing graph-based techniques while it is based only on trees. TreeCaps introduces novel variable-to-static routing algorithms into the capsule networks to compensate for the loss of previous routing algorithms. Aside from accuracy, we also find that TreeCaps is the most robust to withstand those semantic-preserving program transformations that change code syntax without modifying the semantics. Evaluated on a large number of Java and C/C++ programs, TreeCaps models outperform prior deep learning models of program source code, in terms of both accuracy and robustness for program comprehension tasks such as code functionality classification and function name prediction
△ Less
Submitted 14 December, 2020; v1 submitted 5 September, 2020;
originally announced September 2020.
-
Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations
Authors:
Nghi D. Q. Bui,
Yijun Yu,
Lingxiao Jiang
Abstract:
We propose Corder, a self-supervised contrastive learning framework for source code model. Corder is designed to alleviate the need of labeled data for code retrieval and code summarization tasks. The pre-trained model of Corder can be used in two ways: (1) it can produce vector representation of code which can be applied to code retrieval tasks that do not have labeled data; (2) it can be used in…
▽ More
We propose Corder, a self-supervised contrastive learning framework for source code model. Corder is designed to alleviate the need of labeled data for code retrieval and code summarization tasks. The pre-trained model of Corder can be used in two ways: (1) it can produce vector representation of code which can be applied to code retrieval tasks that do not have labeled data; (2) it can be used in a fine-tuning process for tasks that might still require label data such as code summarization. The key innovation is that we train the source code model by asking it to recognize similar and dissimilar code snippets through a contrastive learning objective. To do so, we use a set of semantic-preserving transformation operators to generate code snippets that are syntactically diverse but semantically equivalent. Through extensive experiments, we have shown that the code models pretrained by Corder substantially outperform the other baselines for code-to-code retrieval, text-to-code retrieval, and code-to-text summarization tasks.
△ Less
Submitted 23 May, 2021; v1 submitted 6 September, 2020;
originally announced September 2020.
-
Improving Text to Image Generation using Mode-seeking Function
Authors:
Naitik Bhise,
Zhenfei Zhang,
Tien D. Bui
Abstract:
Generative Adversarial Networks (GANs) have long been used to understand the semantic relationship between the text and image. However, there are problems with mode collapsing in the image generation that causes some preferred output modes. Our aim is to improve the training of the network by using a specialized mode-seeking loss function to avoid this issue. In the text to image synthesis, our lo…
▽ More
Generative Adversarial Networks (GANs) have long been used to understand the semantic relationship between the text and image. However, there are problems with mode collapsing in the image generation that causes some preferred output modes. Our aim is to improve the training of the network by using a specialized mode-seeking loss function to avoid this issue. In the text to image synthesis, our loss function differentiates two points in latent space for the generation of distinct images. We validate our model on the Caltech Birds (CUB) dataset and the Microsoft COCO dataset by changing the intensity of the loss function during the training. Experimental results demonstrate that our model works very well compared to some state-of-the-art approaches.
△ Less
Submitted 18 September, 2020; v1 submitted 19 August, 2020;
originally announced August 2020.
-
Optimizing fire allocation in a NCW-type model
Authors:
Nam Hong Nguyen,
My Anh Vu,
Dinh Van Bui,
Anh Ngoc Ta,
Manh Duc Hy
Abstract:
In this paper, we introduce a non-linear Lanchester model of NCW-type and investigate an optimization problem for this model, where only the Red force is supplied by several supply agents. Optimal fire allocation of the Blue force is sought in the form of a piece-wise constant function of time. A threatening rate is computed for the Red force and each of its supply agents at the beginning of each…
▽ More
In this paper, we introduce a non-linear Lanchester model of NCW-type and investigate an optimization problem for this model, where only the Red force is supplied by several supply agents. Optimal fire allocation of the Blue force is sought in the form of a piece-wise constant function of time. A threatening rate is computed for the Red force and each of its supply agents at the beginning of each stage of the combat. These rates can be used to derive the optimal decision for the Blue force to focus its firepower to the Red force itself or one of its supply agents. This optimal fire allocation is derived and proved by considering an optimization problem of number of Blue force troops. Numerical experiments are included to demonstrate the theoretical results.
△ Less
Submitted 12 August, 2020;
originally announced August 2020.
-
On the Generalizability of Neural Program Models with respect to Semantic-Preserving Program Transformations
Authors:
Md Rafiqul Islam Rabin,
Nghi D. Q. Bui,
Ke Wang,
Yijun Yu,
Lingxiao Jiang,
Mohammad Amin Alipour
Abstract:
With the prevalence of publicly available source code repositories to train deep neural network models, neural program models can do well in source code analysis tasks such as predicting method names in given programs that cannot be easily done by traditional program analysis techniques. Although such neural program models have been tested on various existing datasets, the extent to which they gen…
▽ More
With the prevalence of publicly available source code repositories to train deep neural network models, neural program models can do well in source code analysis tasks such as predicting method names in given programs that cannot be easily done by traditional program analysis techniques. Although such neural program models have been tested on various existing datasets, the extent to which they generalize to unforeseen source code is largely unknown. Since it is very challenging to test neural program models on all unforeseen programs, in this paper, we propose to evaluate the generalizability of neural program models with respect to semantic-preserving transformations: a generalizable neural program model should perform equally well on programs that are of the same semantics but of different lexical appearances and syntactical structures. We compare the results of various neural program models for the method name prediction task on programs before and after automated semantic-preserving transformations. We use three Java datasets of different sizes and three state-of-the-art neural network models for code, namely code2vec, code2seq, and GGNN, to build nine such neural program models for evaluation. Our results show that even with small semantically preserving changes to the programs, these neural program models often fail to generalize their performance. Our results also suggest that neural program models based on data and control dependencies in programs generalize better than neural program models based only on abstract syntax trees. On the positive side, we observe that as the size of the training dataset grows and diversifies the generalizability of correct predictions produced by the neural program models can be improved too. Our results on the generalizability of neural program models provide insights to measure their limitations and provide a step** stone for their improvement.
△ Less
Submitted 18 March, 2021; v1 submitted 31 July, 2020;
originally announced August 2020.
-
Multi-Site Infant Brain Segmentation Algorithms: The iSeg-2019 Challenge
Authors:
Yue Sun,
Kun Gao,
Zhengwang Wu,
Zhihao Lei,
Ying Wei,
Jun Ma,
** Yang,
Xue Feng,
Li Zhao,
Trung Le Phan,
Jitae Shin,
Tao Zhong,
Yu Zhang,
Lequan Yu,
Caizi Li,
Ramesh Basnet,
M. Omair Ahmad,
M. N. S. Swamy,
Wenao Ma,
Qi Dou,
Toan Duc Bui,
Camilo Bermudez Noguera,
Bennett Landman,
Ian H. Gotlib,
Kathryn L. Humphreys
, et al. (8 additional authors not shown)
Abstract:
To better understand early brain growth patterns in health and disorder, it is critical to accurately segment infant brain magnetic resonance (MR) images into white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF). Deep learning-based methods have achieved state-of-the-art performance; however, one of major limitations is that the learning-based methods may suffer from the multi-site i…
▽ More
To better understand early brain growth patterns in health and disorder, it is critical to accurately segment infant brain magnetic resonance (MR) images into white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF). Deep learning-based methods have achieved state-of-the-art performance; however, one of major limitations is that the learning-based methods may suffer from the multi-site issue, that is, the models trained on a dataset from one site may not be applicable to the datasets acquired from other sites with different imaging protocols/scanners. To promote methodological development in the community, iSeg-2019 challenge (http://iseg2019.web.unc.edu) provides a set of 6-month infant subjects from multiple sites with different protocols/scanners for the participating methods. Training/validation subjects are from UNC (MAP) and testing subjects are from UNC/UMN (BCP), Stanford University, and Emory University. By the time of writing, there are 30 automatic segmentation methods participating in iSeg-2019. We review the 8 top-ranked teams by detailing their pipelines/implementations, presenting experimental results and evaluating performance in terms of the whole brain, regions of interest, and gyral landmark curves. We also discuss their limitations and possible future directions for the multi-site issue. We hope that the multi-site dataset in iSeg-2019 and this review article will attract more researchers on the multi-site issue.
△ Less
Submitted 11 July, 2020; v1 submitted 4 July, 2020;
originally announced July 2020.
-
Variational Auto-Regressive Gaussian Processes for Continual Learning
Authors:
Sanyam Kapoor,
Theofanis Karaletsos,
Thang D. Bui
Abstract:
Through sequential construction of posteriors on observing data online, Bayes' theorem provides a natural framework for continual learning. We develop Variational Auto-Regressive Gaussian Processes (VAR-GPs), a principled posterior updating mechanism to solve sequential tasks in continual learning. By relying on sparse inducing point approximations for scalable posteriors, we propose a novel auto-…
▽ More
Through sequential construction of posteriors on observing data online, Bayes' theorem provides a natural framework for continual learning. We develop Variational Auto-Regressive Gaussian Processes (VAR-GPs), a principled posterior updating mechanism to solve sequential tasks in continual learning. By relying on sparse inducing point approximations for scalable posteriors, we propose a novel auto-regressive variational distribution which reveals two fruitful connections to existing results in Bayesian inference, expectation propagation and orthogonal inducing points. Mean predictive entropy estimates show VAR-GPs prevent catastrophic forgetting, which is empirically supported by strong performance on modern continual learning benchmarks against competitive baselines. A thorough ablation study demonstrates the efficacy of our modeling choices.
△ Less
Submitted 12 June, 2021; v1 submitted 9 June, 2020;
originally announced June 2020.
-
LIAAD: Lightweight Attentive Angular Distillation for Large-scale Age-Invariant Face Recognition
Authors:
Thanh-Dat Truong,
Chi Nhan Duong,
Kha Gia Quach,
Ngan Le,
Tien D. Bui,
Khoa Luu
Abstract:
Disentangled representations have been commonly adopted to Age-invariant Face Recognition (AiFR) tasks. However, these methods have reached some limitations with (1) the requirement of large-scale face recognition (FR) training data with age labels, which is limited in practice; (2) heavy deep network architectures for high performance; and (3) their evaluations are usually taken place on age-rela…
▽ More
Disentangled representations have been commonly adopted to Age-invariant Face Recognition (AiFR) tasks. However, these methods have reached some limitations with (1) the requirement of large-scale face recognition (FR) training data with age labels, which is limited in practice; (2) heavy deep network architectures for high performance; and (3) their evaluations are usually taken place on age-related face databases while neglecting the standard large-scale FR databases to guarantee robustness. This work presents a novel Lightweight Attentive Angular Distillation (LIAAD) approach to Large-scale Lightweight AiFR that overcomes these limitations. Given two high-performance heavy networks as teachers with different specialized knowledge, LIAAD introduces a learning paradigm to efficiently distill the age-invariant attentive and angular knowledge from those teachers to a lightweight student network making it more powerful with higher FR accuracy and robust against age factor. Consequently, LIAAD approach is able to take the advantages of both FR datasets with and without age labels to train an AiFR model. Far apart from prior distillation methods mainly focusing on accuracy and compression ratios in closed-set problems, our LIAAD aims to solve the open-set problem, i.e. large-scale face recognition. Evaluations on LFW, IJB-B and IJB-C Janus, AgeDB and MegaFace-FGNet with one million distractors have demonstrated the efficiency of the proposed approach on light-weight structure. This work also presents a new longitudinal face aging (LogiFace) database \footnote{This database will be made available} for further studies in age-related facial problems in future.
△ Less
Submitted 11 September, 2022; v1 submitted 8 April, 2020;
originally announced April 2020.
-
Hierarchical Gaussian Process Priors for Bayesian Neural Network Weights
Authors:
Theofanis Karaletsos,
Thang D. Bui
Abstract:
Probabilistic neural networks are typically modeled with independent weight priors, which do not capture weight correlations in the prior and do not provide a parsimonious interface to express properties in function space. A desirable class of priors would represent weights compactly, capture correlations between weights, facilitate calibrated reasoning about uncertainty, and allow inclusion of pr…
▽ More
Probabilistic neural networks are typically modeled with independent weight priors, which do not capture weight correlations in the prior and do not provide a parsimonious interface to express properties in function space. A desirable class of priors would represent weights compactly, capture correlations between weights, facilitate calibrated reasoning about uncertainty, and allow inclusion of prior knowledge about the function space such as periodicity or dependence on contexts such as inputs. To this end, this paper introduces two innovations: (i) a Gaussian process-based hierarchical model for network weights based on unit embeddings that can flexibly encode correlated weight structures, and (ii) input-dependent versions of these weight priors that can provide convenient ways to regularize the function space through the use of kernels defined on contextual inputs. We show these models provide desirable test-time uncertainty estimates on out-of-distribution data, demonstrate cases of modeling inductive biases for neural networks with kernels which help both interpolation and extrapolation from training data, and demonstrate competitive predictive performance on an active learning benchmark.
△ Less
Submitted 10 February, 2020;
originally announced February 2020.
-
A Fast Template-based Approach to Automatically Identify Primary Text Content of a Web Page
Authors:
Dat Quoc Nguyen,
Dai Quoc Nguyen,
Son Bao Pham,
The Duy Bui
Abstract:
Search engines have become an indispensable tool for browsing information on the Internet. The user, however, is often annoyed by redundant results from irrelevant Web pages. One reason is because search engines also look at non-informative blocks of Web pages such as advertisement, navigation links, etc. In this paper, we propose a fast algorithm called FastContentExtractor to automatically detec…
▽ More
Search engines have become an indispensable tool for browsing information on the Internet. The user, however, is often annoyed by redundant results from irrelevant Web pages. One reason is because search engines also look at non-informative blocks of Web pages such as advertisement, navigation links, etc. In this paper, we propose a fast algorithm called FastContentExtractor to automatically detect main content blocks in a Web page by improving the ContentExtractor algorithm. By automatically identifying and storing templates representing the structure of content blocks in a website, content blocks of a new Web page from the Website can be extracted quickly. The hierarchical order of the output blocks is also maintained which guarantees that the extracted content blocks are in the same order as the original ones.
△ Less
Submitted 26 November, 2019;
originally announced November 2019.