Search | arXiv e-print repository

Artificial Intelligence for the Electron Ion Collider (AI4EIC)

Authors: C. Allaire, R. Ammendola, E. -C. Aschenauer, M. Balandat, M. Battaglieri, J. Bernauer, M. Bondì, N. Branson, T. Britton, A. Butter, I. Chahrour, P. Chatagnon, E. Cisbani, E. W. Cline, S. Dash, C. Dean, W. Deconinck, A. Deshpande, M. Diefenthaler, R. Ent, C. Fanelli, M. Finger, M. Finger, Jr., E. Fol, S. Furletov , et al. (70 additional authors not shown)

Abstract: The Electron-Ion Collider (EIC), a state-of-the-art facility for studying the strong force, is expected to begin commissioning its first experiments in 2028. This is an opportune time for artificial intelligence (AI) to be included from the start at this facility and in all phases that lead up to the experiments. The second annual workshop organized by the AI4EIC working group, which recently took… ▽ More The Electron-Ion Collider (EIC), a state-of-the-art facility for studying the strong force, is expected to begin commissioning its first experiments in 2028. This is an opportune time for artificial intelligence (AI) to be included from the start at this facility and in all phases that lead up to the experiments. The second annual workshop organized by the AI4EIC working group, which recently took place, centered on exploring all current and prospective application areas of AI for the EIC. This workshop is not only beneficial for the EIC, but also provides valuable insights for the newly established ePIC collaboration at EIC. This paper summarizes the different activities and R&D projects covered across the sessions of the workshop and provides an overview of the goals, approaches and strategies regarding AI/ML in the EIC community, as well as cutting-edge techniques currently studied in other experiments. △ Less

Submitted 17 July, 2023; originally announced July 2023.

Comments: 27 pages, 11 figures, AI4EIC workshop, tutorials and hackathon

arXiv:2210.09805 [pdf, other]

Domain Specific Sub-network for Multi-Domain Neural Machine Translation

Authors: Amr Hendy, Mohamed Abdelghaffar, Mohamed Afify, Ahmed Y. Tawfik

Abstract: This paper presents Domain-Specific Sub-network (DoSS). It uses a set of masks obtained through pruning to define a sub-network for each domain and finetunes the sub-network parameters on domain data. This performs very closely and drastically reduces the number of parameters compared to finetuning the whole network on each domain. Also a method to make masks unique per domain is proposed and show… ▽ More This paper presents Domain-Specific Sub-network (DoSS). It uses a set of masks obtained through pruning to define a sub-network for each domain and finetunes the sub-network parameters on domain data. This performs very closely and drastically reduces the number of parameters compared to finetuning the whole network on each domain. Also a method to make masks unique per domain is proposed and shown to greatly improve the generalization to unseen domains. In our experiments on German to English machine translation the proposed method outperforms the strong baseline of continue training on multi-domain (medical, tech and religion) data by 1.47 BLEU points. Also continue training DoSS on new domain (legal) outperforms the multi-domain (medical, tech, religion, legal) baseline by 1.52 BLEU points. △ Less

Submitted 18 October, 2022; originally announced October 2022.

Comments: 6 pages, 1 figure, 5 tables, AACL-IJCNLP 2022 conference

arXiv:2203.03392 [pdf]

Naturally-meaningful and efficient descriptors: machine learning of material properties based on robust one-shot ab initio descriptors

Authors: Sherif Abdulkader Tawfik, Salvy P. Russo

Abstract: Establishing a data-driven pipeline for the discovery of novel materials requires the engineering of material features that can be feasibly calculated and can be applied to predict a material's target properties. Here we propose a new class of descriptors for describing crystal structures, which we term Robust One-Shot Ab initio (ROSA) descriptors. ROSA is computationally cheap and is shown to acc… ▽ More Establishing a data-driven pipeline for the discovery of novel materials requires the engineering of material features that can be feasibly calculated and can be applied to predict a material's target properties. Here we propose a new class of descriptors for describing crystal structures, which we term Robust One-Shot Ab initio (ROSA) descriptors. ROSA is computationally cheap and is shown to accurately predict a range of material properties. These simple and intuitive class of descriptors are generated from the energetics of a material at a low level of theory using an incomplete ab initio calculation. We demonstrate how the incorporation of ROSA descriptors in ML-based property prediction leads to accurate predictions over a wide range of crystals, amorphized crystals, metal-organic frameworks and molecules. We believe that the low computational cost and ease of use of these descriptors will significantly improve ML-based predictions. △ Less

Submitted 2 October, 2022; v1 submitted 2 March, 2022; originally announced March 2022.

Comments: 13 pages, accepted in Journal of Cheminformatics

arXiv:2111.13284 [pdf, other]

Ensembling of Distilled Models from Multi-task Teachers for Constrained Resource Language Pairs

Authors: Amr Hendy, Esraa A. Gad, Mohamed Abdelghaffar, Jailan S. ElMosalami, Mohamed Afify, Ahmed Y. Tawfik, Hany Hassan Awadalla

Abstract: This paper describes our submission to the constrained track of WMT21 shared news translation task. We focus on the three relatively low resource language pairs Bengali to and from Hindi, English to and from Hausa, and Xhosa to and from Zulu. To overcome the limitation of relatively low parallel data we train a multilingual model using a multitask objective employing both parallel and monolingual… ▽ More This paper describes our submission to the constrained track of WMT21 shared news translation task. We focus on the three relatively low resource language pairs Bengali to and from Hindi, English to and from Hausa, and Xhosa to and from Zulu. To overcome the limitation of relatively low parallel data we train a multilingual model using a multitask objective employing both parallel and monolingual data. In addition, we augment the data using back translation. We also train a bilingual model incorporating back translation and knowledge distillation then combine the two models using sequence-to-sequence map**. We see around 70% relative gain in BLEU point for English to and from Hausa, and around 25% relative improvements for both Bengali to and from Hindi, and Xhosa to and from Zulu compared to bilingual baselines. △ Less

Submitted 25 November, 2021; originally announced November 2021.

arXiv:2011.07933 [pdf, other]

Score Combination for Improved Parallel Corpus Filtering for Low Resource Conditions

Authors: Muhammad N. ElNokrashy, Amr Hendy, Mohamed Abdelghaffar, Mohamed Afify, Ahmed Tawfik, Hany Hassan Awadalla

Abstract: This paper describes our submission to the WMT20 sentence filtering task. We combine scores from (1) a custom LASER built for each source language, (2) a classifier built to distinguish positive and negative pairs by semantic alignment, and (3) the original scores included in the task devkit. For the mBART finetuning setup, provided by the organizers, our method shows 7% and 5% relative improvemen… ▽ More This paper describes our submission to the WMT20 sentence filtering task. We combine scores from (1) a custom LASER built for each source language, (2) a classifier built to distinguish positive and negative pairs by semantic alignment, and (3) the original scores included in the task devkit. For the mBART finetuning setup, provided by the organizers, our method shows 7% and 5% relative improvement over baseline, in sacreBLEU score on the test set for Pashto and Khmer respectively. △ Less

Submitted 16 November, 2020; originally announced November 2020.

Comments: Accepted at WMT20 (EMNLP 2020 Fifth Conference on Machine Translation)

arXiv:1802.09287 [pdf]

Gender Aware Spoken Language Translation Applied to English-Arabic

Authors: Mostafa Elaraby, Ahmed Y. Tawfik, Mahmoud Khaled, Hany Hassan, Aly Osama

Abstract: Spoken Language Translation (SLT) is becoming more widely used and becoming a communication tool that helps in crossing language barriers. One of the challenges of SLT is the translation from a language without gender agreement to a language with gender agreement such as English to Arabic. In this paper, we introduce an approach to tackle such limitation by enabling a Neural Machine Translation sy… ▽ More Spoken Language Translation (SLT) is becoming more widely used and becoming a communication tool that helps in crossing language barriers. One of the challenges of SLT is the translation from a language without gender agreement to a language with gender agreement such as English to Arabic. In this paper, we introduce an approach to tackle such limitation by enabling a Neural Machine Translation system to produce gender-aware translation. We show that NMT system can model the speaker/listener gender information to produce gender-aware translation. We propose a method to generate data used in adapting a NMT system to produce gender-aware. The proposed approach can achieve significant improvement of the translation quality by 2 BLEU points. △ Less

Submitted 26 February, 2018; originally announced February 2018.

Comments: Proceedings of the Second International Conference on Natural Language and Speech Processing, 2018 IEEE

arXiv:1707.00079 [pdf, other]

Synthetic Data for Neural Machine Translation of Spoken-Dialects

Authors: Hany Hassan, Mostafa Elaraby, Ahmed Tawfik

Abstract: In this paper, we introduce a novel approach to generate synthetic data for training Neural Machine Translation systems. The proposed approach transforms a given parallel corpus between a written language and a target language to a parallel corpus between a spoken dialect variant and the target language. Our approach is language independent and can be used to generate data for any variant of the s… ▽ More In this paper, we introduce a novel approach to generate synthetic data for training Neural Machine Translation systems. The proposed approach transforms a given parallel corpus between a written language and a target language to a parallel corpus between a spoken dialect variant and the target language. Our approach is language independent and can be used to generate data for any variant of the source language such as slang or spoken dialect or even for a different language that is closely related to the source language. The proposed approach is based on local embedding projection of distributed representations which utilizes monolingual embeddings to transform parallel data across language variants. We report experimental results on Levantine to English translation using Neural Machine Translation. We show that the generated data can improve a very large scale system by more than 2.8 Bleu points using synthetic spoken data which shows that it can be used to provide a reliable translation system for a spoken dialect that does not have sufficient parallel data. △ Less

Submitted 28 November, 2017; v1 submitted 30 June, 2017; originally announced July 2017.

arXiv:1306.0024 [pdf, ps, other]

doi 10.1139/cjp-2014-0412

Calibrated Fair Measures of Measure: Indices to Quantify an Individual's Scientific Research Output

Authors: A. Tawfik

Abstract: Are existing ways of measuring scientific quality reflecting disadvantages of not being part of giant collaborations? How could possible discrimination be avoided? We propose indices defined for each discipline (subfield) and which count the plausible contributions added up by collaborators maintaining the spirit of interdependency. Based on the growing debate about defining potential biases and d… ▽ More Are existing ways of measuring scientific quality reflecting disadvantages of not being part of giant collaborations? How could possible discrimination be avoided? We propose indices defined for each discipline (subfield) and which count the plausible contributions added up by collaborators maintaining the spirit of interdependency. Based on the growing debate about defining potential biases and detecting unethical behavior, a standardized method to measure contributions of the astronomical number of coauthors is introduced. △ Less

Submitted 31 May, 2013; originally announced June 2013.

Comments: 10 pages, 2 figures with 4 eps graphs

Report number: ECTP-2013-02, WLCAPP-2013-02

Journal ref: Canadian J. Phys. 93, 745-749 (2015)

Showing 1–8 of 8 results for author: Tawfik, A