Search | arXiv e-print repository

SELFormer: Molecular Representation Learning via SELFIES Language Models

Authors: Atakan Yüksel, Erva Ulusoy, Atabey Ünlü, Tunca Doğan

Abstract: Automated computational analysis of the vast chemical space is critical for numerous fields of research such as drug discovery and material science. Representation learning techniques have recently been employed with the primary objective of generating compact and informative numerical expressions of complex data. One approach to efficiently learn molecular representations is processing string-bas… ▽ More Automated computational analysis of the vast chemical space is critical for numerous fields of research such as drug discovery and material science. Representation learning techniques have recently been employed with the primary objective of generating compact and informative numerical expressions of complex data. One approach to efficiently learn molecular representations is processing string-based notations of chemicals via natural language processing (NLP) algorithms. Majority of the methods proposed so far utilize SMILES notations for this purpose; however, SMILES is associated with numerous problems related to validity and robustness, which may prevent the model from effectively uncovering the knowledge hidden in the data. In this study, we propose SELFormer, a transformer architecture-based chemical language model that utilizes a 100% valid, compact and expressive notation, SELFIES, as input, in order to learn flexible and high-quality molecular representations. SELFormer is pre-trained on two million drug-like compounds and fine-tuned for diverse molecular property prediction tasks. Our performance evaluation has revealed that, SELFormer outperforms all competing methods, including graph learning-based approaches and SMILES-based chemical language models, on predicting aqueous solubility of molecules and adverse drug reactions. We also visualized molecular representations learned by SELFormer via dimensionality reduction, which indicated that even the pre-trained model can discriminate molecules with differing structural properties. We shared SELFormer as a programmatic tool, together with its datasets and pre-trained models. Overall, our research demonstrates the benefit of using the SELFIES notations in the context of chemical language modeling and opens up new possibilities for the design and discovery of novel drug candidates with desired features. △ Less

Submitted 25 May, 2023; v1 submitted 10 April, 2023; originally announced April 2023.

Comments: 22 pages, 4 figures, 8 tables

MSC Class: 68T07 ACM Class: I.2.1; I.2.6; I.5.4

arXiv:2212.13543 [pdf]

Democratising Knowledge Representation with BioCypher

Authors: Sebastian Lobentanzer, Patrick Aloy, Jan Baumbach, Balazs Bohar, Pornpimol Charoentong, Katharina Danhauser, Tunca Doğan, Johann Dreo, Ian Dunham, Adrià Fernandez-Torras, Benjamin M. Gyori, Michael Hartung, Charles Tapley Hoyt, Christoph Klein, Tamas Korcsmaros, Andreas Maier, Matthias Mann, David Ochoa, Elena Pareja-Lorente, Ferdinand Popp, Martin Preusse, Niklas Probul, Benno Schwikowski, Bünyamin Sen, Maximilian T. Strauss , et al. (4 additional authors not shown)

Abstract: Standardising the representation of biomedical knowledge among all researchers is an insurmountable task, hindering the effectiveness of many computational methods. To facilitate harmonisation and interoperability despite this fundamental challenge, we propose to standardise the framework of knowledge graph creation instead. We implement this standardisation in BioCypher, a FAIR (findable, accessi… ▽ More Standardising the representation of biomedical knowledge among all researchers is an insurmountable task, hindering the effectiveness of many computational methods. To facilitate harmonisation and interoperability despite this fundamental challenge, we propose to standardise the framework of knowledge graph creation instead. We implement this standardisation in BioCypher, a FAIR (findable, accessible, interoperable, reusable) framework to transparently build biomedical knowledge graphs while preserving provenances of the source data. Map** the knowledge onto biomedical ontologies helps to balance the needs for harmonisation, human and machine readability, and ease of use and accessibility to non-specialist researchers. We demonstrate the usefulness of this framework on a variety of use cases, from maintenance of task-specific knowledge stores, to interoperability between biomedical domains, to on-demand building of task-specific knowledge graphs for federated learning. BioCypher (https://biocypher.org) frees up valuable developer time; we encourage further development and usage by the community. △ Less

Submitted 17 January, 2023; v1 submitted 27 December, 2022; originally announced December 2022.

Comments: 34 pages, 6 figures; submitted to Nature Biotechnology

arXiv:2002.05922 [pdf]

doi 10.1109/TIP.2020.2972112

Realizing a Low-Power Head-Mounted Phase-Only Holographic Display by Light-Weight Compression

Authors: Burak Soner, Erdem Ulusoy, A. Murat Tekalp, Hakan Urey

Abstract: Head-mounted holographic displays (HMHD) are projected to be the first commercial realization of holographic video display systems. HMHDs use liquid crystal on silicon (LCoS) spatial light modulators (SLM), which are best suited to display phase-only holograms (POH). The performance/watt requirement of a monochrome, 60 fps Full HD, 2-eye, POH HMHD system is about 10 TFLOPS/W, which is orders of ma… ▽ More Head-mounted holographic displays (HMHD) are projected to be the first commercial realization of holographic video display systems. HMHDs use liquid crystal on silicon (LCoS) spatial light modulators (SLM), which are best suited to display phase-only holograms (POH). The performance/watt requirement of a monochrome, 60 fps Full HD, 2-eye, POH HMHD system is about 10 TFLOPS/W, which is orders of magnitude higher than that is achievable by commercially available mobile processors. To mitigate this compute power constraint, display-ready POHs shall be generated on a nearby server and sent to the HMHD in compressed form over a wireless link. This paper discusses design of a feasible HMHD-based augmented reality system, focusing on compression requirements and per-pixel rate-distortion trade-off for transmission of display-ready POH from the server to HMHD. Since the decoder in the HMHD needs to operate on low power, only coding methods that have low-power decoder implementation are considered. Effects of 2D phase unwrap** and flat quantization on compression performance are also reported. We next propose a versatile PCM-POH codec with progressive quantization that can adapt to SLM-dynamic-range and available bitrate, and features per-pixel rate-distortion control to achieve acceptable POH quality at target rates of 60-200 Mbit/s that can be reliably achieved by current wireless technologies. Our results demonstrate feasibility of realizing a low-power, quality-ensured, multi-user, interactive HMHD augmented reality system with commercially available components using the proposed adaptive compression of display-ready POH with light-weight decoding. △ Less

Submitted 14 February, 2020; originally announced February 2020.

Comments: 10 pages, 6 figures, accepted for publication in the IEEE Transactions on Image Processing

Journal ref: IEEE Transactions on Image Processing, vol. 29, pp. 4505-4515, 2020

Showing 1–3 of 3 results for author: Ulusoy, E