-
Bloom Library: Multimodal Datasets in 300+ Languages for a Variety of Downstream Tasks
Authors:
Colin Leong,
Joshua Nemecek,
Jacob Mansdorfer,
Anna Filighera,
Abraham Owodunni,
Daniel Whitenack
Abstract:
We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or among the most, multilingual datasets for each of the included downstream tasks. In total, the initial release of the Bloom Library datasets covers 363 languages ac…
▽ More
We present Bloom Library, a linguistically diverse set of multimodal and multilingual datasets for language modeling, image captioning, visual storytelling, and speech synthesis/recognition. These datasets represent either the most, or among the most, multilingual datasets for each of the included downstream tasks. In total, the initial release of the Bloom Library datasets covers 363 languages across 32 language families. We train downstream task models for various languages represented in the data, showing the viability of the data for future work in low-resource, multimodal NLP and establishing the first known baselines for these downstream tasks in certain languages (e.g., Bisu [bzi], with an estimated population of 700 users). Some of these first-of-their-kind baselines are comparable to state-of-the-art performance for higher-resourced languages. The Bloom Library datasets are released under Creative Commons licenses on the Hugging Face datasets hub to catalyze more linguistically diverse research in the included downstream tasks.
△ Less
Submitted 26 October, 2022;
originally announced October 2022.
-
Dyn-ASR: Compact, Multilingual Speech Recognition via Spoken Language and Accent Identification
Authors:
Sangeeta Ghangam,
Daniel Whitenack,
Joshua Nemecek
Abstract:
Running automatic speech recognition (ASR) on edge devices is non-trivial due to resource constraints, especially in scenarios that require supporting multiple languages. We propose a new approach to enable multilingual speech recognition on edge devices. This approach uses both language identification and accent identification to select one of multiple monolingual ASR models on-the-fly, each fine…
▽ More
Running automatic speech recognition (ASR) on edge devices is non-trivial due to resource constraints, especially in scenarios that require supporting multiple languages. We propose a new approach to enable multilingual speech recognition on edge devices. This approach uses both language identification and accent identification to select one of multiple monolingual ASR models on-the-fly, each fine-tuned for a particular accent. Initial results for both recognition performance and resource usage are promising with our approach using less than 1/12th of the memory consumed by other solutions.
△ Less
Submitted 4 August, 2021;
originally announced August 2021.
-
Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages
Authors:
Wilhelmina Nekoto,
Vukosi Marivate,
Tshinondiwa Matsila,
Timi Fasubaa,
Tajudeen Kolawole,
Taiwo Fagbohungbe,
Solomon Oluwole Akinola,
Shamsuddeen Hassan Muhammad,
Salomon Kabongo,
Salomey Osei,
Sackey Freshia,
Rubungo Andre Niyongabo,
Ricky Macharm,
Perez Ogayo,
Orevaoghene Ahia,
Musie Meressa,
Mofe Adeyemi,
Masabata Mokgesi-Selinga,
Lawrence Okegbemi,
Laura Jane Martinus,
Kolawole Tajudeen,
Kevin Degila,
Kelechi Ogueji,
Kathleen Siminyu,
Julia Kreutzer
, et al. (23 additional authors not shown)
Abstract:
Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communicat…
▽ More
Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released under https://github.com/masakhane-io/masakhane-mt.
△ Less
Submitted 6 November, 2020; v1 submitted 5 October, 2020;
originally announced October 2020.
-
Masakhane -- Machine Translation For Africa
Authors:
Iroro Orife,
Julia Kreutzer,
Blessing Sibanda,
Daniel Whitenack,
Kathleen Siminyu,
Laura Martinus,
Jamiil Toure Ali,
Jade Abbott,
Vukosi Marivate,
Salomon Kabongo,
Musie Meressa,
Espoir Murhabazi,
Orevaoghene Ahia,
Elan van Biljon,
Arshath Ramkilowan,
Adewale Akinfaderin,
Alp Öktem,
Wole Akin,
Ghollah Kioko,
Kevin Degila,
Herman Kamper,
Bonaventure Dossou,
Chris Emezue,
Kelechi Ogueji,
Abdallah Bashir
Abstract:
Africa has over 2000 languages. Despite this, African languages account for a small portion of available resources and publications in Natural Language Processing (NLP). This is due to multiple factors, including: a lack of focus from government and funding, discoverability, a lack of community, sheer language complexity, difficulty in reproducing papers and no benchmarks to compare techniques. To…
▽ More
Africa has over 2000 languages. Despite this, African languages account for a small portion of available resources and publications in Natural Language Processing (NLP). This is due to multiple factors, including: a lack of focus from government and funding, discoverability, a lack of community, sheer language complexity, difficulty in reproducing papers and no benchmarks to compare techniques. To begin to address the identified problems, MASAKHANE, an open-source, continent-wide, distributed, online research effort for machine translation for African languages, was founded. In this paper, we discuss our methodology for building the community and spurring research from the African continent, as well as outline the success of the community in terms of addressing the identified problems affecting African NLP.
△ Less
Submitted 13 March, 2020;
originally announced March 2020.
-
Katecheo: A Portable and Modular System for Multi-Topic Question Answering
Authors:
Shirish Hirekodi,
Seban Sunny,
Leonard Topno,
Alwin Daniel,
Daniel Whitenack,
Reuben Skewes,
Stuart Cranney
Abstract:
We introduce a modular system that can be deployed on any Kubernetes cluster for question answering via REST API. This system, called Katecheo, includes three configurable modules that collectively enable identification of questions, classification of those questions into topics, document search, and reading comprehension. We demonstrate the system using publicly available knowledge base articles…
▽ More
We introduce a modular system that can be deployed on any Kubernetes cluster for question answering via REST API. This system, called Katecheo, includes three configurable modules that collectively enable identification of questions, classification of those questions into topics, document search, and reading comprehension. We demonstrate the system using publicly available knowledge base articles extracted from Stack Exchange sites. However, users can extend the system to any number of topics, or domains, without the need to modify any of the model serving code or train their own models. All components of the system are open source and available under a permissive Apache 2 License.
△ Less
Submitted 31 January, 2020; v1 submitted 1 July, 2019;
originally announced July 2019.
-
Stark Ionization of Atoms and Molecules within Density Functional Resonance Theory
Authors:
Ask Hjorth Larsen,
Umberto De Giovannini,
Daniel Lee Whitenack,
Adam Wasserman,
Angel Rubio
Abstract:
We show that the energetics and lifetimes of resonances of finite systems under an external electric field can be captured by Kohn--Sham density functional theory (DFT) within the formalism of uniform complex scaling. Properties of resonances are calculated self-consistently in terms of complex densities, potentials and wavefunctions using adapted versions of the known algorithms from DFT. We illu…
▽ More
We show that the energetics and lifetimes of resonances of finite systems under an external electric field can be captured by Kohn--Sham density functional theory (DFT) within the formalism of uniform complex scaling. Properties of resonances are calculated self-consistently in terms of complex densities, potentials and wavefunctions using adapted versions of the known algorithms from DFT. We illustrate this new formalism by calculating ionization rates using the complex-scaled local density approximation and exact exchange. We consider a variety of atoms (H, He, Li and Be) as well as the hydrogen molecule. Extensions are briefly discussed.
△ Less
Submitted 11 September, 2013;
originally announced September 2013.
-
Exchange-Correlation Asymptotics and High Harmonic Spectra
Authors:
Michael Mack,
Daniel Whitenack,
Adam Wasserman
Abstract:
This paper has been withdrawn by the authors; the main conclusion is incorrect, as some of the crucial calculations were not properly converged.
This paper has been withdrawn by the authors; the main conclusion is incorrect, as some of the crucial calculations were not properly converged.
△ Less
Submitted 13 June, 2012; v1 submitted 25 April, 2012;
originally announced April 2012.
-
Effects of magnetic dipole-dipole interactions in atomic Bose-Einstein condensates with tunable s-wave interactions
Authors:
Abraham J. Olson,
Daniel L. Whitenack,
Yong P. Chen
Abstract:
The s-wave interaction is usually the dominant form of interactions in atomic Bose-Einstein condensates (BECs). Recently, Feshbach resonances have been employed to reduce the strength of the s-wave interaction in many atomic speicies. This opens the possibilities to study magnetic dipole-dipole interactions (MDDI) in BECs, where the novel physics resulting from long-range and anisotropic dipolar i…
▽ More
The s-wave interaction is usually the dominant form of interactions in atomic Bose-Einstein condensates (BECs). Recently, Feshbach resonances have been employed to reduce the strength of the s-wave interaction in many atomic speicies. This opens the possibilities to study magnetic dipole-dipole interactions (MDDI) in BECs, where the novel physics resulting from long-range and anisotropic dipolar interactions can be explored. Using a variational method, we study the effect of MDDI on the statics and dynamics of atomic BECs with tunable s-wave interactions. We benchmark our calculation against previously observed MDDI effects in $^{52}$Cr with excellent agreement, and predict new effects that should be promising to observe experimentally. A parameter of magnetic Feshbach resonances, $ε_{dd,\text{max}}$, is used to quantitatively indicate the feasibility of experimentally observing MDDI effects in different atomic species. We find that strong MDDI effects should be observable in both in-trap and time-of-flight behaviors for the alkali BECs of $^{7}$Li, $^{39}$K, and $^{133}$Cs. Our results provide a helpful guide for experimentalists to realize and study atomic dipolar quantum gases.
△ Less
Submitted 23 July, 2013; v1 submitted 6 April, 2012;
originally announced April 2012.
-
Density Functional Resonance Theory: complex density functions, convergence, orbital energies, and functionals
Authors:
Daniel L. Whitenack,
Adam Wasserman
Abstract:
Aspects of Density Functional Resonance Theory (DFRT) [Phys. Rev. Lett. \textbf{107}, 163002 (2011)], a recently developed complex-scaled version of ground-state Density Functional Theory (DFT), are studied in detail. The asymptotic behavior of the complex density function is related to the complex resonance energy and system's threshold energy, and the function's local oscillatory behavior is con…
▽ More
Aspects of Density Functional Resonance Theory (DFRT) [Phys. Rev. Lett. \textbf{107}, 163002 (2011)], a recently developed complex-scaled version of ground-state Density Functional Theory (DFT), are studied in detail. The asymptotic behavior of the complex density function is related to the complex resonance energy and system's threshold energy, and the function's local oscillatory behavior is connected with preferential directions of electron decay. Practical considerations for implementation of the theory are addressed including sensitivity to the complex-scaling parameter, $θ$. In Kohn-Sham DFRT, it is shown that almost all $θ$-dependence in the calculated energies and lifetimes can be extinguished via use of a proper basis set or fine grid. The highest occupied Kohn-Sham orbital energy and lifetime are related to a physical affinity and width, and the threshold energy of the Kohn-Sham system is shown to be equal to the threshold energy of the interacting system shifted by a well-defined functional. Finally, various complex-scaling conditions are derived which relate the functionals of ground-state DFT to those of DFRT via proper scaling factors and a non-Hermitian coupling constant system.
△ Less
Submitted 23 February, 2012;
originally announced February 2012.
-
Density Functional Theory for Fractional Particle Number: Derivative Discontinuity of the Energy at the Maximum Number of Bound Electrons
Authors:
Daniel L. Whitenack,
Yu Zhang,
Adam Wasserman
Abstract:
The derivative discontinuity in the exact exchange-correlation potential of ensemble Density Functional Theory (DFT) is investigated at the specific integer number that corresponds to the maximum number of bound electrons, $J_{max}$. A recently developed complex-scaled analog of DFT is extended to fractional particle numbers and used to study ensembles of both bound and metastable states. It is fo…
▽ More
The derivative discontinuity in the exact exchange-correlation potential of ensemble Density Functional Theory (DFT) is investigated at the specific integer number that corresponds to the maximum number of bound electrons, $J_{max}$. A recently developed complex-scaled analog of DFT is extended to fractional particle numbers and used to study ensembles of both bound and metastable states. It is found that the exact exchange-correlation potential experiences discontinuous jumps at integer particle numbers including $J_{max}$. For integers below $J_{max}$ the jump is purely real because of the real shift in the chemical potential. At $J_{max}$, the jump has a non-zero imaginary component reflecting the finite lifetime of the $(J_{max}+1)$ state.
△ Less
Submitted 8 November, 2011;
originally announced November 2011.
-
Density Functional Resonance Theory of Unbound Electronic Systems
Authors:
Daniel L. Whitenack,
Adam Wasserman
Abstract:
Density Functional Resonance Theory (DFRT) is a complex-scaled version of ground-state Density Functional Theory (DFT) that allows one to calculate the resonance energies and lifetimes of metastable anions. In this formalism, the exact energy and lifetime of the lowest-energy resonance of unbound systems is encoded into a complex "density" that can be obtained via complex-coordinate scaling. This…
▽ More
Density Functional Resonance Theory (DFRT) is a complex-scaled version of ground-state Density Functional Theory (DFT) that allows one to calculate the resonance energies and lifetimes of metastable anions. In this formalism, the exact energy and lifetime of the lowest-energy resonance of unbound systems is encoded into a complex "density" that can be obtained via complex-coordinate scaling. This complex density is used as the primary variable in a DFRT calculation just as the ground-state density would be used as the primary variable in DFT. As in DFT, there exists a map** of the N-electron interacting system to a Kohn-Sham system of N non-interacting particles in DFRT. This map** facilitates self consistent calculations with an initial guess for the complex density, as illustrated with an exactly-solvable model system. Whereas DFRT yields in principle the exact resonance energy and lifetime of the interacting system, we find that neglecting the complex-correlation contribution leads to errors of similar magnitude to those of standard scattering close-coupling calculations under the bound-state approximation.
△ Less
Submitted 20 June, 2011;
originally announced June 2011.
-
Resonance Lifetimes from Complex Densities
Authors:
Daniel L. Whitenack,
Adam Wasserman
Abstract:
The ab-initio calculation of resonance lifetimes of metastable anions challenges modern quantum-chemical methods. The exact lifetime of the lowest-energy resonance is encoded into a complex "density" that can be obtained via complex-coordinate scaling. We illustrate this with one-electron examples and show how the lifetime can be extracted from the complex density in much the same way as the gro…
▽ More
The ab-initio calculation of resonance lifetimes of metastable anions challenges modern quantum-chemical methods. The exact lifetime of the lowest-energy resonance is encoded into a complex "density" that can be obtained via complex-coordinate scaling. We illustrate this with one-electron examples and show how the lifetime can be extracted from the complex density in much the same way as the ground-state energy of bound systems is extracted from its ground-state density.
△ Less
Submitted 23 October, 2009;
originally announced October 2009.