-
A New Massive Multilingual Dataset for High-Performance Language Technologies
Authors:
Ona de Gibert,
Graeme Nail,
Nikolay Arefyev,
Marta Bañón,
Jelmer van der Linde,
Shaoxiong Ji,
Jaume Zaragoza-Bernabeu,
Mikko Aulamo,
Gema Ramírez-Sánchez,
Andrey Kutuzov,
Sampo Pyysalo,
Stephan Oepen,
Jörg Tiedemann
Abstract:
We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performa…
▽ More
We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ~5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.
△ Less
Submitted 20 March, 2024;
originally announced March 2024.
-
OpusCleaner and OpusTrainer, open source toolkits for training Machine Translation and Large language models
Authors:
Nikolay Bogoychev,
Jelmer van der Linde,
Graeme Nail,
Barry Haddow,
Jaume Zaragoza-Bernabeu,
Gema Ramírez-Sánchez,
Lukas Weymann,
Tudor Nicolae Mateiu,
**dřich Helcl,
Mikko Aulamo
Abstract:
Develo** high quality machine translation systems is a labour intensive, challenging and confusing process for newcomers to the field. We present a pair of tools OpusCleaner and OpusTrainer that aim to simplify the process, reduce the amount of work and lower the entry barrier for newcomers.
OpusCleaner is a data downloading, cleaning, and proprocessing toolkit. It is designed to allow researc…
▽ More
Develo** high quality machine translation systems is a labour intensive, challenging and confusing process for newcomers to the field. We present a pair of tools OpusCleaner and OpusTrainer that aim to simplify the process, reduce the amount of work and lower the entry barrier for newcomers.
OpusCleaner is a data downloading, cleaning, and proprocessing toolkit. It is designed to allow researchers to quickly download, visualise and preprocess bilingual (or monolingual) data that comes from many different sources, each of them with different quality, issues, and unique filtering/preprocessing requirements.
OpusTrainer is a data scheduling and data augmenting tool aimed at building large scale, robust machine translation systems and large language models. It features deterministic data mixing from many different sources, on-the-fly data augmentation and more.
Using these tools, we showcase how we can use it to create high quality machine translation model robust to noisy user input; multilingual models and terminology aware models.
△ Less
Submitted 24 November, 2023;
originally announced November 2023.
-
TranslateLocally: Blazing-fast translation running on the local CPU
Authors:
Nikolay Bogoychev,
Jelmer Van der Linde,
Kenneth Heafield
Abstract:
Every day, millions of people sacrifice their privacy and browsing habits in exchange for online machine translation. Companies and governments with confidentiality requirements often ban online translation or pay a premium to disable logging. To bring control back to the end user and demonstrate speed, we developed translateLocally. Running locally on a desktop or laptop CPU, translateLocally del…
▽ More
Every day, millions of people sacrifice their privacy and browsing habits in exchange for online machine translation. Companies and governments with confidentiality requirements often ban online translation or pay a premium to disable logging. To bring control back to the end user and demonstrate speed, we developed translateLocally. Running locally on a desktop or laptop CPU, translateLocally delivers cloud-like translation speed and quality even on 10 year old hardware. The open-source software is based on Marian and runs on Linux, Windows, and macOS.
△ Less
Submitted 21 September, 2021;
originally announced September 2021.
-
LUCE: A Blockchain Solution for monitoring data License accoUntability and CompliancE
Authors:
Andine Havelange,
Michel Dumontier,
Birgit Wouters,
Jona Linde,
David Townend,
Arno Riedl,
Visara Urovi
Abstract:
In this paper we present our preliminary work on monitoring data License accoUntability and CompliancE (LUCE). LUCE is a blockchain platform solution designed to stimulate data sharing and reuse, by facilitating compliance with licensing terms. The platform enables data accountability by recording the use of data and their purpose on a blockchain-supported platform. LUCE allows for individual data…
▽ More
In this paper we present our preliminary work on monitoring data License accoUntability and CompliancE (LUCE). LUCE is a blockchain platform solution designed to stimulate data sharing and reuse, by facilitating compliance with licensing terms. The platform enables data accountability by recording the use of data and their purpose on a blockchain-supported platform. LUCE allows for individual data to be rectified and erased. In doing so LUCE can ensure subjects' General Data Protection Regulation's (GDPR) rights to access, rectification and erasure. Our contribution is to provide a distributed solution for the automatic management of data accountability and their license terms.
△ Less
Submitted 6 August, 2019;
originally announced August 2019.
-
Matrices commuting with a given normal tropical matrix
Authors:
J. Linde,
M. J. de la Puente
Abstract:
Consider the space $M_n^{nor}$ of square normal matrices $X=(x_{ij})$ over $\mathbb{R}\cup\{-\infty\}$, i.e., $-\infty\le x_{ij}\le0$ and $x_{ii}=0$. Endow $M_n^{nor}$ with the tropical sum $\oplus$ and multiplication $\odot$. Fix a real matrix $A\in M_n^{nor}$ and consider the set $Ω(A)$ of matrices in $M_n^{nor}$ which commute with $A$. We prove that $Ω(A)$ is a finite union of alcoved polytopes…
▽ More
Consider the space $M_n^{nor}$ of square normal matrices $X=(x_{ij})$ over $\mathbb{R}\cup\{-\infty\}$, i.e., $-\infty\le x_{ij}\le0$ and $x_{ii}=0$. Endow $M_n^{nor}$ with the tropical sum $\oplus$ and multiplication $\odot$. Fix a real matrix $A\in M_n^{nor}$ and consider the set $Ω(A)$ of matrices in $M_n^{nor}$ which commute with $A$. We prove that $Ω(A)$ is a finite union of alcoved polytopes; in particular, $Ω(A)$ is a finite union of convex sets. The set $Ω^A(A)$ of $X$ such that $A\odot X=X\odot A=A$ is also a finite union of alcoved polytopes. The same is true for the set $Ω'(A)$ of $X$ such that $A\odot X=X\odot A=X$.
A topology is given to $M_n^{nor}$. Then, the set $Ω^{A}(A)$ is a neighborhood of the identity matrix $I$. If $A$ is strictly normal, then $Ω'(A)$ is a neighborhood of the zero matrix. In one case, $Ω(A)$ is a neighborhood of $A$. We give an upper bound for the dimension of $Ω'(A)$. We explore the relationship between the polyhedral complexes $span A$, $span X$ and $span (AX)$, when $A$ and $X$ commute. Two matrices, denoted $\underline{A}$ and $\bar{A}$, arise from $A$, in connection with $Ω(A)$. The geometric meaning of them is given in detail, for one example. We produce examples of matrices which commute, in any dimension.
△ Less
Submitted 3 December, 2014; v1 submitted 4 September, 2012;
originally announced September 2012.
-
Towards Vertex Algebras of Krichever-Novikov Type, Part I
Authors:
K. J. Linde
Abstract:
It is shown that a certain representation of the Heisenberg type Krichever-Novikov algebra gives rise to a state field correspondence that is quite similar to the vertex algebra structure of the usual Heisenberg algebra. Finally a definition of Krichever-Novikov type vertex algebras is proposed and its relation to vertex algebras is discussed.
It is shown that a certain representation of the Heisenberg type Krichever-Novikov algebra gives rise to a state field correspondence that is quite similar to the vertex algebra structure of the usual Heisenberg algebra. Finally a definition of Krichever-Novikov type vertex algebras is proposed and its relation to vertex algebras is discussed.
△ Less
Submitted 29 May, 2003;
originally announced May 2003.
-
Dust in the Local Interstellar Wind
Authors:
P. C. Frisch,
J. M. Dorschner,
J. Geiss,
J. M. Greenberg,
E. Grün,
M. Landgraf,
P. Hoppe,
A. P. Jones,
W. Krätschmer,
T. J. Linde,
G. E. Morfill,
W. T. Reach,
J. D. Slavin,
J. Svestka,
A. N. Witt,
G. P. Zank
Abstract:
The gas-to-dust mass ratios found for interstellar dust within the Solar System, versus values determined astronomically for the cloud around the Solar System, suggest that large and small interstellar grains have separate histories, and that large interstellar grains preferentially detected by spacecraft are not formed exclusively by mass exchange with nearby interstellar gas. Observations by t…
▽ More
The gas-to-dust mass ratios found for interstellar dust within the Solar System, versus values determined astronomically for the cloud around the Solar System, suggest that large and small interstellar grains have separate histories, and that large interstellar grains preferentially detected by spacecraft are not formed exclusively by mass exchange with nearby interstellar gas. Observations by the Ulysses and Galileo satellites of the mass spectrum and flux rate of interstellar dust within the heliosphere are combined with information about the density, composition, and relative flow speed and direction of interstellar gas in the cloud surrounding the solar system to derive an in situ value for the gas-to-dust mass ratio, $R_{g/d} = 94^{+46}_{-38}$. Hubble observations of the cloud surrounding the solar system yield a gas-to-dust mass ratio of Rg/d=551+61-251 when B-star reference abundances are assumed. The exclusion of small dust grains from the heliosheath and heliosphere regions are modeled, increasing the discrepancy between interstellar and in situ observations. The shock destruction of interstellar grains is considered, and comparisons are made with interplanetary and presolar dust grains.
△ Less
Submitted 10 May, 1999;
originally announced May 1999.
-
Is there a mass dependence in the spin structure of baryons?
Authors:
Johan Linde,
Hakan Snellman
Abstract:
We analyze the axial-vector form factors of the nucleon-hyperon system in a model with SU(3)$_{\text{flavor}}$ symmetry breaking due to mass dependent quark spin polarizations. This mass dependence is deduced from an analysis of magnetic moment data, and implies that the spin contributions from the quarks to a baryon decrease with the mass of the baryon. When applied to the axial-vector form fac…
▽ More
We analyze the axial-vector form factors of the nucleon-hyperon system in a model with SU(3)$_{\text{flavor}}$ symmetry breaking due to mass dependent quark spin polarizations. This mass dependence is deduced from an analysis of magnetic moment data, and implies that the spin contributions from the quarks to a baryon decrease with the mass of the baryon. When applied to the axial-vector form factors, these mass dependent spin polarizations bring the various sum-rules from the quark model in better agreement with experimental data. As a consequence our analysis leads to a reduced value for the total spin polarization of the proton.
△ Less
Submitted 26 May, 1998;
originally announced May 1998.
-
Decuplet Baryon Magnetic Moments in the Chiral Quark Model
Authors:
Johan Linde,
Tommy Ohlsson,
Hakan Snellman
Abstract:
We present calculations of the decuplet baryon magnetic moments in the chiral quark model. As input we use parameters obtained in qualitatively accurate fits to the octet baryon magnetic moments studied previously. The values found for the magnetic moments of $Δ^{++}$ and $Ω^{-}$ are in good agreement with experiments. We finally calculate the total quark spin polarizations of the decuplet baryo…
▽ More
We present calculations of the decuplet baryon magnetic moments in the chiral quark model. As input we use parameters obtained in qualitatively accurate fits to the octet baryon magnetic moments studied previously. The values found for the magnetic moments of $Δ^{++}$ and $Ω^{-}$ are in good agreement with experiments. We finally calculate the total quark spin polarizations of the decuplet baryons and find that they are considerably smaller than what is expected from the non-relativistic quark model.
△ Less
Submitted 15 April, 1998; v1 submitted 26 September, 1997;
originally announced September 1997.
-
Octet Baryon Magnetic Moments in the Chiral Quark Model with Configuration Mixing
Authors:
Johan Linde,
Tommy Ohlsson,
Hakan Snellman
Abstract:
The Coleman-Glashow sum-rule for magnetic moments is always fulfilled in the chiral quark model, independently of SU(3) symmetry breaking. This is due to the structure of the wave functions, coming from the non-relativistic quark model. Experimentally, the Coleman-Glashow sum-rule is violated by about ten standard deviations. To overcome this problem, two models of wave functions with configurat…
▽ More
The Coleman-Glashow sum-rule for magnetic moments is always fulfilled in the chiral quark model, independently of SU(3) symmetry breaking. This is due to the structure of the wave functions, coming from the non-relativistic quark model. Experimentally, the Coleman-Glashow sum-rule is violated by about ten standard deviations. To overcome this problem, two models of wave functions with configuration mixing are studied. One of these models violates the Coleman-Glashow sum-rule to the right degree and also reproduces the octet baryon magnetic moments rather accurately.
△ Less
Submitted 30 December, 1997; v1 submitted 15 September, 1997;
originally announced September 1997.
-
Charmonium in the instantaneous approximation
Authors:
Johan Linde,
Hakan Snellman
Abstract:
The charmonium system is studied in a Salpeter model with a vector plus scalar potential. We use a kinematical formalism based on the one developed by Suttorp, and present general eigenvalue equations and expressions for decay observables in an onium system for such a potential both in the Feynman and Coulomb gauges. Special attention is paid to the problem with renormalization of the lepton pai…
▽ More
The charmonium system is studied in a Salpeter model with a vector plus scalar potential. We use a kinematical formalism based on the one developed by Suttorp, and present general eigenvalue equations and expressions for decay observables in an onium system for such a potential both in the Feynman and Coulomb gauges. Special attention is paid to the problem with renormalization of the lepton pair decays, and we argue that they must be defined relative to one of the experimental decay widths because renormalization of the vertex function is not possible. The parameters of the model are determined by a fit to the mass spectrum and the lepton pair decay rates. Two gamma decays and E1 and M1 transitions are then calculated and found to be well accounted for. No significant differences in the results in Feynman or Coulomb gauge are found. A comparison is made, regarding the electromagnetic transitions, between the full and reduced Salpeter equation. A large difference is found showing the importance of using the full Salpeter equation.
△ Less
Submitted 21 March, 1997;
originally announced March 1997.
-
Evidence for mass dependent effects in the spin structure of baryons
Authors:
Johan Linde,
Håkan Snellman
Abstract:
We analyze the axial-vector form factors of the nucleon hyperon system in a model with mass dependent quark spin polarizations. This mass dependence is deduced from an earlier analysis\cite{jlhs,jlhs2} of magnetic moment data, and implies that the spin contributions from the quarks to a baryon decrease with the mass of the baryon. When applied to the axial-vector form factors, these mass depende…
▽ More
We analyze the axial-vector form factors of the nucleon hyperon system in a model with mass dependent quark spin polarizations. This mass dependence is deduced from an earlier analysis\cite{jlhs,jlhs2} of magnetic moment data, and implies that the spin contributions from the quarks to a baryon decrease with the mass of the baryon. When applied to the axial-vector form factors, these mass dependent spin polarizations bring the various sum-rules from the model in better agreement with experimental data. Our analysis leads to a reduced value for the total spin polarization of the proton.
△ Less
Submitted 21 December, 1995;
originally announced December 1995.
-
Magnetic moments of the 3/2 resonances and their quark spin structure
Authors:
Johan Linde,
Håkan Snellman
Abstract:
We discuss magnetic moments of the $J=3/2$ baryons based on an earlier model for the baryon magnetic moments, allowing for flavor symmetry breaking in the quark magnetic moments as well as a general quark spin structure. From our earlier analysis of the nucleon-hyperon magnetic moments and the measured values of the magnetic moments of $Δ^{++}$ and $Ω^{-}$ we predict the other magnetic moments a…
▽ More
We discuss magnetic moments of the $J=3/2$ baryons based on an earlier model for the baryon magnetic moments, allowing for flavor symmetry breaking in the quark magnetic moments as well as a general quark spin structure. From our earlier analysis of the nucleon-hyperon magnetic moments and the measured values of the magnetic moments of $Δ^{++}$ and $Ω^{-}$ we predict the other magnetic moments and deduce the spin structure of the resonance particles. We find from experiment that the total spin polarization of the decuplet baryons, $ΔΣ(3/2)$, is considerably smaller than the non-relativistic quark model value of 3, although the data is still not good enough to give a precise determination.
△ Less
Submitted 22 November, 1995; v1 submitted 24 October, 1995;
originally announced October 1995.
-
The Hamburg Quasar Monitoring Program (HQM) at Calar Alto III. Lightcurves of optically violent variable sources
Authors:
K. -J. Schramm. U. Borgeest,
D. Kühl,
J. von Linde,
M. D. Linnert,
T. Schramm
Abstract:
HQM is an optical broad-band photometric monitoring program carried out since September 1988. We use a CCD camera at the MPIA 1.2$\,$m telescope. Fully automatic photometric reduction relative to stars in the frames is done within a few minutes after each exposure, thus interesting brightness changes can be followed in detail. The typical photometric error is 1--2\,\% for a 17.5\,mag quasar. We…
▽ More
HQM is an optical broad-band photometric monitoring program carried out since September 1988. We use a CCD camera at the MPIA 1.2$\,$m telescope. Fully automatic photometric reduction relative to stars in the frames is done within a few minutes after each exposure, thus interesting brightness changes can be followed in detail. The typical photometric error is 1--2\,\% for a 17.5\,mag quasar. We here present lightcurves of 14 known violently variable sources and compare them with literature data. For two BL\,Lac objects, 1E\,1229+645 and 4C\,56.27 (1823+568), this paper is the first variability study. We have also carried out POSS photometry to obtain indications for variability on a longer timescale.
△ Less
Submitted 23 March, 1994;
originally announced March 1994.
-
The Hamburg Quasar Monitoring Program (HQM) at Calar Alto. II. Lightcurves of weakly variable objects
Authors:
K. -J. Schramm,
U. Borgeest,
D. Kühl,
J. v. Linde,
M. D. Linnert
Abstract:
HQM is an optical broad-band photometric monitoring program carried out since Sept.~1988. We use a CCD camera equipped to the MPIA 1.2$\,$m telescope. Fully automatic photometric reduction relative to stars in the frames is done within a few minutes after each exposure, thus interesting brightness changes can be followed in detail. The typical photometric error is 1--2\,\% for a 17.5\,mag quasar…
▽ More
HQM is an optical broad-band photometric monitoring program carried out since Sept.~1988. We use a CCD camera equipped to the MPIA 1.2$\,$m telescope. Fully automatic photometric reduction relative to stars in the frames is done within a few minutes after each exposure, thus interesting brightness changes can be followed in detail. The typical photometric error is 1--2\,\% for a 17.5\,mag quasar. We here present lightcurves already evaluated but not shown in Paper I. We also discuss existing literature data.
△ Less
Submitted 23 March, 1994;
originally announced March 1994.
-
Flavor symmetry breaking in quark magnetic moments
Authors:
Johan Linde,
Håkan Snellman
Abstract:
We discuss the magnetic moments of the baryons allowing for flavor symmetry breaking in the quark magnetic moments. We show that there is a correlation between isospin symmetry breaking and data for the nucleon spin structure obtained from deep inelastic scattering. For small values of the isospin symmetry breaking, of the order of $5\%$, the magnetic moments and weak axial-vector form factors a…
▽ More
We discuss the magnetic moments of the baryons allowing for flavor symmetry breaking in the quark magnetic moments. We show that there is a correlation between isospin symmetry breaking and data for the nucleon spin structure obtained from deep inelastic scattering. For small values of the isospin symmetry breaking, of the order of $5\%$, the magnetic moments and weak axial-vector form factors alone indicate a value of the spin polarization $ΔΣ\leq 0.20$. Larger values of the spin polarization are compatible only with large isospin symmetry breaking. We also calculate weak axial-vector form factors, which are independent of the symmetry breakings, from magnetic moment data and find good agreement with experiment.
△ Less
Submitted 15 June, 1994; v1 submitted 22 November, 1993;
originally announced November 1993.