-
Antibody Representation Learning for Drug Discovery
Authors:
Lin Li,
Esther Gupta,
John Spaeth,
Leslie Shing,
Tristan Bepler,
Rajmonda Sulo Caceres
Abstract:
Therapeutic antibody development has become an increasingly popular approach for drug development. To date, antibody therapeutics are largely developed using large scale experimental screens of antibody libraries containing hundreds of millions of antibody sequences. The high cost and difficulty of develo** therapeutic antibodies create a pressing need for computational methods to predict antibo…
▽ More
Therapeutic antibody development has become an increasingly popular approach for drug development. To date, antibody therapeutics are largely developed using large scale experimental screens of antibody libraries containing hundreds of millions of antibody sequences. The high cost and difficulty of develo** therapeutic antibodies create a pressing need for computational methods to predict antibody properties and create bespoke designs. However, the relationship between antibody sequence and activity is a complex physical process and traditional iterative design approaches rely on large scale assays and random mutagenesis. Deep learning methods have emerged as a promising way to learn antibody property predictors, but predicting antibody properties and target-specific activities depends critically on the choice of antibody representations and data linking sequences to properties is often limited. Existing works have not yet investigated the value, limitations and opportunities of these methods in application to antibody-based drug discovery. In this paper, we present results on a novel SARS-CoV-2 antibody binding dataset and an additional benchmark dataset. We compare three classes of models: conventional statistical sequence models, supervised learning on each dataset independently, and fine-tuning an antibody specific pre-trained language model. Experimental results suggest that self-supervised pretraining of feature representation consistently offers significant improvement in over previous approaches. We also investigate the impact of data size on the model performance, and discuss challenges and opportunities that the machine learning community can address to advance in silico engineering and design of therapeutic antibodies.
△ Less
Submitted 5 October, 2022;
originally announced October 2022.
-
Fluently specifying taint-flow queries with fluentTQL
Authors:
Goran Piskachev,
Johannes Späth,
Ingo Budde,
Eric Bodden
Abstract:
Previous work has shown that taint analyses are only useful if correctly customized to the context in which they are used. Existing domain-specific languages (DSLs) allow such customization through the definition of deny-listing data-flow rules that describe potentially vulnerable taint-flows. These languages, however, are designed primarily for security experts who are knowledgeable in taint anal…
▽ More
Previous work has shown that taint analyses are only useful if correctly customized to the context in which they are used. Existing domain-specific languages (DSLs) allow such customization through the definition of deny-listing data-flow rules that describe potentially vulnerable taint-flows. These languages, however, are designed primarily for security experts who are knowledgeable in taint analysis. Software developers consider these languages to be complex. This paper presents fluentTQL, a query language particularly for taint-flow. fluentTQL is internal Java DSL and uses a fluent-interface design. fluentTQL queries can express various taint-style vulnerability types, e.g. injections, cross-site scripting or path traversal. This paper describes fluentTQL's abstract and concrete syntax and defines its runtime semantics. The semantics are independent of any underlying analysis and allows evaluation of fluentTQL queries by a variety of taint analyses. Instantiations of fluentTQL, on top of two taint analysis solvers, Boomerang and FlowDroid, show and validate fluentTQL expressiveness. Based on existing examples from the literature, we implemented queries for 11 popular security vulnerability types in Java. Using our SQL injection specification, the Boomerang-based taint analysis found all 17 known taint-flows in the OWASP WebGoat application, whereas with FlowDroid 13 taint-flows were found. Similarly, in a vulnerable version of the Java PetClinic application, the Boomerang-based taint analysis found all seven expected taint-flows. In seven real-world Android apps with 25 expected taint-flows, 18 were detected. In a user study with 26 software developers, fluentTQL reached a high usability score. In comparison to CodeQL, the state-of-the-art DSL by Semmle/GitHub, participants found fluentTQL more usable and with it they were able to specify taint analysis queries in shorter time.
△ Less
Submitted 6 April, 2022;
originally announced April 2022.
-
The FeatureCloud AI Store for Federated Learning in Biomedicine and Beyond
Authors:
Julian Matschinske,
Julian Späth,
Reza Nasirigerdeh,
Reihaneh Torkzadehmahani,
Anne Hartebrodt,
Balázs Orbán,
Sándor Fejér,
Olga Zolotareva,
Mohammad Bakhtiari,
Béla Bihari,
Marcus Bloice,
Nina C Donner,
Walid Fdhila,
Tobias Frisch,
Anne-Christin Hauschild,
Dominik Heider,
Andreas Holzinger,
Walter Hötzendorfer,
Jan Hospes,
Tim Kacprowski,
Markus Kastelitz,
Markus List,
Rudolf Mayer,
Mónika Moga,
Heimo Müller
, et al. (7 additional authors not shown)
Abstract:
Machine Learning (ML) and Artificial Intelligence (AI) have shown promising results in many areas and are driven by the increasing amount of available data. However, this data is often distributed across different institutions and cannot be shared due to privacy concerns. Privacy-preserving methods, such as Federated Learning (FL), allow for training ML models without sharing sensitive data, but t…
▽ More
Machine Learning (ML) and Artificial Intelligence (AI) have shown promising results in many areas and are driven by the increasing amount of available data. However, this data is often distributed across different institutions and cannot be shared due to privacy concerns. Privacy-preserving methods, such as Federated Learning (FL), allow for training ML models without sharing sensitive data, but their implementation is time-consuming and requires advanced programming skills. Here, we present the FeatureCloud AI Store for FL as an all-in-one platform for biomedical research and other applications. It removes large parts of this complexity for developers and end-users by providing an extensible AI Store with a collection of ready-to-use apps. We show that the federated apps produce similar results to centralized ML, scale well for a typical number of collaborators and can be combined with Secure Multiparty Computation (SMPC), thereby making FL algorithms safely and easily applicable in biomedical and clinical environments.
△ Less
Submitted 12 May, 2021;
originally announced May 2021.
-
Privacy-preserving Artificial Intelligence Techniques in Biomedicine
Authors:
Reihaneh Torkzadehmahani,
Reza Nasirigerdeh,
David B. Blumenthal,
Tim Kacprowski,
Markus List,
Julian Matschinske,
Julian Späth,
Nina Kerstin Wenke,
Béla Bihari,
Tobias Frisch,
Anne Hartebrodt,
Anne-Christin Hausschild,
Dominik Heider,
Andreas Holzinger,
Walter Hötzendorfer,
Markus Kastelitz,
Rudolf Mayer,
Cristian Nogales,
Anastasia Pustozerova,
Richard Röttger,
Harald H. H. W. Schmidt,
Ameli Schwalber,
Christof Tschohl,
Andrea Wohner,
Jan Baumbach
Abstract:
Artificial intelligence (AI) has been successfully applied in numerous scientific domains. In biomedicine, AI has already shown tremendous potential, e.g. in the interpretation of next-generation sequencing data and in the design of clinical decision support systems. However, training an AI model on sensitive data raises concerns about the privacy of individual participants. For example, summary s…
▽ More
Artificial intelligence (AI) has been successfully applied in numerous scientific domains. In biomedicine, AI has already shown tremendous potential, e.g. in the interpretation of next-generation sequencing data and in the design of clinical decision support systems. However, training an AI model on sensitive data raises concerns about the privacy of individual participants. For example, summary statistics of a genome-wide association study can be used to determine the presence or absence of an individual in a given dataset. This considerable privacy risk has led to restrictions in accessing genomic and other biomedical data, which is detrimental for collaborative research and impedes scientific progress. Hence, there has been a substantial effort to develop AI methods that can learn from sensitive data while protecting individuals' privacy. This paper provides a structured overview of recent advances in privacy-preserving AI techniques in biomedicine. It places the most important state-of-the-art approaches within a unified taxonomy and discusses their strengths, limitations, and open problems. As the most promising direction, we suggest combining federated machine learning as a more scalable approach with other additional privacy preserving techniques. This would allow to merge the advantages to provide privacy guarantees in a distributed way for biomedical applications. Nonetheless, more research is necessary as hybrid approaches pose new challenges such as additional network or computation overhead.
△ Less
Submitted 6 November, 2020; v1 submitted 22 July, 2020;
originally announced July 2020.
-
CrySL: Validating Correct Usage of Cryptographic APIs
Authors:
Stefan Krüger,
Johannes Späth,
Karim Ali,
Eric Bodden,
Mira Mezini
Abstract:
Various studies have empirically shown that the majority of Java and Android apps misuse cryptographic libraries, causing devastating breaches of data security. Therefore, it is crucial to detect such misuses early in the development process. The fact that insecure usages are not the exception but the norm precludes approaches based on property inference and anomaly detection.
In this paper, we…
▽ More
Various studies have empirically shown that the majority of Java and Android apps misuse cryptographic libraries, causing devastating breaches of data security. Therefore, it is crucial to detect such misuses early in the development process. The fact that insecure usages are not the exception but the norm precludes approaches based on property inference and anomaly detection.
In this paper, we present CrySL, a definition language that enables cryptography experts to specify the secure usage of the cryptographic libraries that they provide. CrySL combines the generic concepts of method-call sequences and data-flow constraints with domain-specific constraints related to cryptographic algorithms and their parameters. We have implemented a compiler that translates a CrySL ruleset into a context- and flow-sensitive demand-driven static analysis. The analysis automatically checks a given Java or Android app for violations of the CrySL-encoded rules.
We empirically evaluated our ruleset through analyzing 10,001 Android apps. Our results show that misuse of cryptographic APIs is still widespread, with 96% of apps containing at least one misuse. However, we observed fewer of the misuses that were reported in previous work.
△ Less
Submitted 2 October, 2017;
originally announced October 2017.