-
Learning to Optimise Wind Farms with Graph Transformers
Authors:
Siyi Li,
Arnaud Robert,
A. Aldo Faisal,
Matthew D. Piggott
Abstract:
This work proposes a novel data-driven model capable of providing accurate predictions for the power generation of all wind turbines in wind farms of arbitrary layout, yaw angle configurations and wind conditions. The proposed model functions by encoding a wind farm into a fully-connected graph and processing the graph representation through a graph transformer. The graph transformer surrogate is…
▽ More
This work proposes a novel data-driven model capable of providing accurate predictions for the power generation of all wind turbines in wind farms of arbitrary layout, yaw angle configurations and wind conditions. The proposed model functions by encoding a wind farm into a fully-connected graph and processing the graph representation through a graph transformer. The graph transformer surrogate is shown to generalise well and is able to uncover latent structural patterns within the graph representation of wind farms. It is demonstrated how the resulting surrogate model can be used to optimise yaw angle configurations using genetic algorithms, achieving similar levels of accuracy to industrially-standard wind farm simulation tools while only taking a fraction of the computational cost.
△ Less
Submitted 21 November, 2023;
originally announced November 2023.
-
Gaussian process deconvolution
Authors:
Felipe Tobar,
Arnaud Robert,
Jorge F. Silva
Abstract:
Let us consider the deconvolution problem, that is, to recover a latent source $x(\cdot)$ from the observations $\mathbf{y} = [y_1,\ldots,y_N]$ of a convolution process $y = x\star h + η$, where $η$ is an additive noise, the observations in $\mathbf{y}$ might have missing parts with respect to $y$, and the filter $h$ could be unknown. We propose a novel strategy to address this task when $x$ is a…
▽ More
Let us consider the deconvolution problem, that is, to recover a latent source $x(\cdot)$ from the observations $\mathbf{y} = [y_1,\ldots,y_N]$ of a convolution process $y = x\star h + η$, where $η$ is an additive noise, the observations in $\mathbf{y}$ might have missing parts with respect to $y$, and the filter $h$ could be unknown. We propose a novel strategy to address this task when $x$ is a continuous-time signal: we adopt a Gaussian process (GP) prior on the source $x$, which allows for closed-form Bayesian nonparametric deconvolution. We first analyse the direct model to establish the conditions under which the model is well defined. Then, we turn to the inverse problem, where we study i) some necessary conditions under which Bayesian deconvolution is feasible, and ii) to which extent the filter $h$ can be learnt from data or approximated for the blind deconvolution case. The proposed approach, termed Gaussian process deconvolution (GPDC) is compared to other deconvolution methods conceptually, via illustrative examples, and using real-world datasets.
△ Less
Submitted 8 May, 2023; v1 submitted 8 May, 2023;
originally announced May 2023.
-
ImmunoLingo: Linguistics-based formalization of the antibody language
Authors:
Mai Ha Vu,
Philippe A. Robert,
Rahmad Akbar,
Bartlomiej Swiatczak,
Geir Kjetil Sandve,
Dag Trygve Truslew Haug,
Victor Greiff
Abstract:
Apparent parallels between natural language and biological sequence have led to a recent surge in the application of deep language models (LMs) to the analysis of antibody and other biological sequences. However, a lack of a rigorous linguistic formalization of biological sequence languages, which would define basic components, such as lexicon (i.e., the discrete units of the language) and grammar…
▽ More
Apparent parallels between natural language and biological sequence have led to a recent surge in the application of deep language models (LMs) to the analysis of antibody and other biological sequences. However, a lack of a rigorous linguistic formalization of biological sequence languages, which would define basic components, such as lexicon (i.e., the discrete units of the language) and grammar (i.e., the rules that link sequence well-formedness, structure, and meaning) has led to largely domain-unspecific applications of LMs, which do not take into account the underlying structure of the biological sequences studied. A linguistic formalization, on the other hand, establishes linguistically-informed and thus domain-adapted components for LM applications. It would facilitate a better understanding of how differences and similarities between natural language and biological sequences influence the quality of LMs, which is crucial for the design of interpretable models with extractable sequence-functions relationship rules, such as the ones underlying the antibody specificity prediction problem. Deciphering the rules of antibody specificity is crucial to accelerating rational and in silico biotherapeutic drug design. Here, we formalize the properties of the antibody language and thereby establish not only a foundation for the application of linguistic tools in adaptive immune receptor analysis but also for the systematic immunolinguistic studies of immune receptor specificity in general.
△ Less
Submitted 29 November, 2022; v1 submitted 26 September, 2022;
originally announced September 2022.
-
Linguistically inspired roadmap for building biologically reliable protein language models
Authors:
Mai Ha Vu,
Rahmad Akbar,
Philippe A. Robert,
Bartlomiej Swiatczak,
Victor Greiff,
Geir Kjetil Sandve,
Dag Trygve Truslew Haug
Abstract:
Deep neural-network-based language models (LMs) are increasingly applied to large-scale protein sequence data to predict protein function. However, being largely black-box models and thus challenging to interpret, current protein LM approaches do not contribute to a fundamental understanding of sequence-function map**s, hindering rule-based biotherapeutic drug development. We argue that guidance…
▽ More
Deep neural-network-based language models (LMs) are increasingly applied to large-scale protein sequence data to predict protein function. However, being largely black-box models and thus challenging to interpret, current protein LM approaches do not contribute to a fundamental understanding of sequence-function map**s, hindering rule-based biotherapeutic drug development. We argue that guidance drawn from linguistics, a field specialized in analytical rule extraction from natural language data, can aid with building more interpretable protein LMs that are more likely to learn relevant domain-specific rules. Differences between protein sequence data and linguistic sequence data require the integration of more domain-specific knowledge in protein LMs compared to natural language LMs. Here, we provide a linguistics-based roadmap for protein LM pipeline choices with regard to training data, tokenization, token embedding, sequence embedding, and model interpretation. Incorporating linguistic ideas into protein LMs enables the development of next-generation interpretable machine-learning models with the potential of uncovering the biological mechanisms underlying sequence-function relationships.
△ Less
Submitted 28 April, 2023; v1 submitted 3 July, 2022;
originally announced July 2022.
-
AntBO: Towards Real-World Automated Antibody Design with Combinatorial Bayesian Optimisation
Authors:
Asif Khan,
Alexander I. Cowen-Rivers,
Antoine Grosnit,
Derrick-Goh-Xin Deik,
Philippe A. Robert,
Victor Greiff,
Eva Smorodina,
Puneet Rawat,
Kamil Dreczkowski,
Rahmad Akbar,
Rasul Tutunov,
Dany Bou-Ammar,
Jun Wang,
Amos Storkey,
Haitham Bou-Ammar
Abstract:
Antibodies are canonically Y-shaped multimeric proteins capable of highly specific molecular recognition. The CDRH3 region located at the tip of variable chains of an antibody dominates antigen-binding specificity. Therefore, it is a priority to design optimal antigen-specific CDRH3 regions to develop therapeutic antibodies. However, the combinatorial nature of CDRH3 sequence space makes it imposs…
▽ More
Antibodies are canonically Y-shaped multimeric proteins capable of highly specific molecular recognition. The CDRH3 region located at the tip of variable chains of an antibody dominates antigen-binding specificity. Therefore, it is a priority to design optimal antigen-specific CDRH3 regions to develop therapeutic antibodies. However, the combinatorial nature of CDRH3 sequence space makes it impossible to search for an optimal binding sequence exhaustively and efficiently using computational approaches. Here, we present \texttt{AntBO}: a combinatorial Bayesian optimisation framework enabling efficient \textit{in silico} design of the CDRH3 region. Ideally, antibodies are expected to have high target specificity and developability. We introduce a CDRH3 trust region that restricts the search to sequences with favourable developability scores to achieve this goal. For benchmarking, \texttt{AntBO} uses the \texttt{Absolut!} software suite as a black-box oracle to score the target specificity and affinity of designed antibodies \textit{in silico} in an unconstrained fashion~\citep{robert2021one}. The experiments performed for $159$ discretised antigens used in \texttt{Absolut!} demonstrate the benefit of \texttt{AntBO} in designing CDRH3 regions with diverse biophysical properties. In under $200$ calls to black-box oracle, \texttt{AntBO} can suggest antibody sequences that outperform the best binding sequence drawn from 6.9 million experimentally obtained CDRH3s and a commonly used genetic algorithm baseline. Additionally, \texttt{AntBO} finds very-high affinity CDRH3 sequences in only 38 protein designs whilst requiring no domain knowledge. We conclude \texttt{AntBO} brings automated antibody design methods closer to what is practically viable for in vitro experimentation.
△ Less
Submitted 14 October, 2022; v1 submitted 29 January, 2022;
originally announced January 2022.
-
The Wasserstein-Fourier Distance for Stationary Time Series
Authors:
Elsa Cazelles,
Arnaud Robert,
Felipe Tobar
Abstract:
We propose the Wasserstein-Fourier (WF) distance to measure the (dis)similarity between time series by quantifying the displacement of their energy across frequencies. The WF distance operates by calculating the Wasserstein distance between the (normalised) power spectral densities (NPSD) of time series. Yet this rationale has been considered in the past, we fill a gap in the open literature provi…
▽ More
We propose the Wasserstein-Fourier (WF) distance to measure the (dis)similarity between time series by quantifying the displacement of their energy across frequencies. The WF distance operates by calculating the Wasserstein distance between the (normalised) power spectral densities (NPSD) of time series. Yet this rationale has been considered in the past, we fill a gap in the open literature providing a formal introduction of this distance, together with its main properties from the joint perspective of Fourier analysis and optimal transport. As the main aim of this work is to validate WF as a general-purpose metric for time series, we illustrate its applicability on three broad contexts. First, we rely on WF to implement a PCA-like dimensionality reduction for NPSDs which allows for meaningful visualisation and pattern recognition applications. Second, we show that the geometry induced by WF on the space of NPSDs admits a geodesic interpolant between time series, thus enabling data augmentation on the spectral domain, by averaging the dynamic content of two signals. Third, we implement WF for time series classification using parametric/non-parametric classifiers and compare it to other classical metrics. Supported on theoretical results, as well as synthetic illustrations and experiments on real-world data, this work establishes WF as a meaningful and capable resource pertinent to general distance-based applications of time series.
△ Less
Submitted 11 December, 2020; v1 submitted 11 December, 2019;
originally announced December 2019.
-
Hierarchical QR factorization algorithms for multi-core cluster systems
Authors:
Jack Dongarra,
Mathieu Faverge,
Thomas Herault,
Julien Langou,
and Yves Robert
Abstract:
This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed multi-core nodes. These platforms make the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential ke…
▽ More
This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed multi-core nodes. These platforms make the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential kernels executed by the cores (high sequential performance), low number of messages in a parallel distributed setting (small latency term), and fine granularity (high parallelism).
△ Less
Submitted 7 October, 2011;
originally announced October 2011.
-
Program slicing techniques and its applications
Authors:
N. Sasirekha,
A. Edwin Robert,
Dr. M. Hemalatha
Abstract:
Program understanding is an important aspect in Software Maintenance and Reengineering. Understanding the program is related to execution behaviour and relationship of variable involved in the program. The task of finding all statements in a program that directly or indirectly influence the value for an occurrence of a variable gives the set of statements that can affect the value of a variable at…
▽ More
Program understanding is an important aspect in Software Maintenance and Reengineering. Understanding the program is related to execution behaviour and relationship of variable involved in the program. The task of finding all statements in a program that directly or indirectly influence the value for an occurrence of a variable gives the set of statements that can affect the value of a variable at some point in a program is called a program slice. Program slicing is a technique for extracting parts of computer programs by tracing the programs' control and data flow related to some data item. This technique is applicable in various areas such as debugging, program comprehension and understanding, program integration, cohesion measurement, re-engineering, maintenance, testing where it is useful to be able to focus on relevant parts of large programs. This paper focuses on the various slicing techniques (not limited to) like static slicing, quasi static slicing, dynamic slicing and conditional slicing. This paper also includes various methods in performing the slicing like forward slicing, backward slicing, syntactic slicing and semantic slicing. The slicing of a program is carried out using Java which is a object oriented programming language.
△ Less
Submitted 5 August, 2011;
originally announced August 2011.
-
LEXSYS: Architecture and Implication for Intelligent Agent systems
Authors:
Charles A. B. Robert
Abstract:
LEXSYS, (Legume Expert System) was a project conceived at IITA (International Institute of Tropical Agriculture) Ibadan Nigeria. It was initiated by the COMBS (Collaborative Group on Maize-Based Systems Research in the 1990. It was meant for a general framework for characterizing on-farm testing for technology design for sustainable cereal-based crop** system. LEXSYS is not a true expert system…
▽ More
LEXSYS, (Legume Expert System) was a project conceived at IITA (International Institute of Tropical Agriculture) Ibadan Nigeria. It was initiated by the COMBS (Collaborative Group on Maize-Based Systems Research in the 1990. It was meant for a general framework for characterizing on-farm testing for technology design for sustainable cereal-based crop** system. LEXSYS is not a true expert system as the name would imply, but simply a user-friendly information system. This work is an attempt to give a formal representation of the existing system and then present areas where intelligent agent can be applied.
△ Less
Submitted 26 March, 2010;
originally announced March 2010.
-
Characterization and collection of information from heterogeneous multimedia sources with users' parameters for decision support
Authors:
Charles A. B. Robert
Abstract:
No single information source can be good enough to satisfy the divergent and dynamic needs of users all the time. Integrating information from divergent sources can be a solution to deficiencies in information content. We present how Information from multimedia document can be collected based on associating a generic database to a federated database. Information collected in this way is brought…
▽ More
No single information source can be good enough to satisfy the divergent and dynamic needs of users all the time. Integrating information from divergent sources can be a solution to deficiencies in information content. We present how Information from multimedia document can be collected based on associating a generic database to a federated database. Information collected in this way is brought into relevance by integrating the parameters of usage and user's parameter for decision making. We identified seven different classifications of multimedia document.
△ Less
Submitted 12 November, 2008;
originally announced November 2008.
-
AMIE: An annotation model for information research
Authors:
Charles A. Robert,
David Amos
Abstract:
The objective of most users for consulting any information database, information warehouse or the internet is to resolve one problem or the other. Available online or offline annotation tools were not conceived with the objective of assisting users in their bid to resolve a decisional problem. Apart from the objective and usage of annotation tools, how these tools are conceived and classified ha…
▽ More
The objective of most users for consulting any information database, information warehouse or the internet is to resolve one problem or the other. Available online or offline annotation tools were not conceived with the objective of assisting users in their bid to resolve a decisional problem. Apart from the objective and usage of annotation tools, how these tools are conceived and classified has implication on their usage. Several criteria have been used to categorize annotation concepts. Typically annotation are conceived based on how it affect the organization of document been considered for annotation or the organization of the resulting annotation. Our approach is annotation that will assist in information research for decision making. Annotation model for information exchange (AMIE) was conceived with the objective of information sharing and reuse.
△ Less
Submitted 19 February, 2007;
originally announced February 2007.
-
AMIEDoT: An annotation model for document tracking and recommendation service
Authors:
Charles A. Robert
Abstract:
The primary objective of document annotation in whatever form, manual or electronic is to allow those who may not have control to original document to provide personal view on information source. Beyond providing personal assessment to original information sources, we are looking at a situation where annotation made can be used as additional source of information for document tracking and recomm…
▽ More
The primary objective of document annotation in whatever form, manual or electronic is to allow those who may not have control to original document to provide personal view on information source. Beyond providing personal assessment to original information sources, we are looking at a situation where annotation made can be used as additional source of information for document tracking and recommendation service. Most of the annotation tools existing today were conceived for their independent use with no reference to the creator of the annotation. We propose AMIEDoT (Annotation Model for Information Exchange and Document Tracking) an annotation model that can assist in document tracking and recommendation service. The model is based on three parameters in the acts of annotation. We believe that introducing document parameters, time and the parameters of the creator of annotation into an annotation process can be a dependable source to know, who used a document, when a document was used and for what a document was used for. Beyond document tracking, our model can be used in not only for selective dissemination of information but for recommendation services. AMIEDoT can also be used for information sharing and information reuse.
△ Less
Submitted 19 February, 2007;
originally announced February 2007.