Search | arXiv e-print repository

Scaling Laws for Fine-Grained Mixture of Experts

Authors: Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygóźdź, Piotr Sankowski, Marek Cygan, Sebastian Jaszczur

Abstract: Mixture of Experts (MoE) models have emerged as a primary solution for reducing the computational cost of Large Language Models. In this work, we analyze their scaling properties, incorporating an expanded range of variables. Specifically, we introduce a new hyperparameter, granularity, whose adjustment enables precise control over the size of the experts. Building on this, we establish scaling la… ▽ More Mixture of Experts (MoE) models have emerged as a primary solution for reducing the computational cost of Large Language Models. In this work, we analyze their scaling properties, incorporating an expanded range of variables. Specifically, we introduce a new hyperparameter, granularity, whose adjustment enables precise control over the size of the experts. Building on this, we establish scaling laws for fine-grained MoE, taking into account the number of training tokens, model size, and granularity. Leveraging these laws, we derive the optimal training configuration for a given computational budget. Our findings not only show that MoE models consistently outperform dense Transformers but also highlight that the efficiency gap between dense and MoE models widens as we scale up the model size and training budget. Furthermore, we demonstrate that the common practice of setting the size of experts in MoE to mirror the feed-forward layer is not optimal at almost any computational budget. △ Less

Submitted 12 February, 2024; originally announced February 2024.

arXiv:2401.04081 [pdf, other]

MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts

Authors: Maciej Pióro, Kamil Ciebiera, Krystian Król, Jan Ludziejewski, Michał Krutul, Jakub Krajewski, Szymon Antoniak, Piotr Miłoś, Marek Cygan, Sebastian Jaszczur

Abstract: State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based Large Language Models, including recent state-of-the-art open models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcas… ▽ More State Space Models (SSMs) have become serious contenders in the field of sequential modeling, challenging the dominance of Transformers. At the same time, Mixture of Experts (MoE) has significantly improved Transformer-based Large Language Models, including recent state-of-the-art open models. We propose that to unlock the potential of SSMs for scaling, they should be combined with MoE. We showcase this on Mamba, a recent SSM-based model that achieves remarkable performance. Our model, MoE-Mamba, outperforms both Mamba and baseline Transformer-MoE. In particular, MoE-Mamba reaches the same performance as Mamba in $2.35\times$ fewer training steps while preserving the inference performance gains of Mamba against Transformer. △ Less

Submitted 26 February, 2024; v1 submitted 8 January, 2024; originally announced January 2024.

arXiv:2310.15961 [pdf, other]

Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation

Authors: Szymon Antoniak, Sebastian Jaszczur, Michał Krutul, Maciej Pióro, Jakub Krajewski, Jan Ludziejewski, Tomasz Odrzygóźdź, Marek Cygan

Abstract: Despite the promise of Mixture of Experts (MoE) models in increasing parameter counts of Transformer models while maintaining training and inference costs, their application carries notable drawbacks. The key strategy of these models is to, for each processed token, activate at most a few experts - subsets of an extensive feed-forward layer. But this approach is not without its challenges. The ope… ▽ More Despite the promise of Mixture of Experts (MoE) models in increasing parameter counts of Transformer models while maintaining training and inference costs, their application carries notable drawbacks. The key strategy of these models is to, for each processed token, activate at most a few experts - subsets of an extensive feed-forward layer. But this approach is not without its challenges. The operation of matching experts and tokens is discrete, which makes MoE models prone to issues like training instability and uneven expert utilization. Existing techniques designed to address these concerns, such as auxiliary losses or balance-aware matching, result either in lower model performance or are more difficult to train. In response to these issues, we propose Mixture of Tokens, a fully-differentiable model that retains the benefits of MoE architectures while avoiding the aforementioned difficulties. Rather than routing tokens to experts, this approach mixes tokens from different examples prior to feeding them to experts, enabling the model to learn from all token-expert combinations. Importantly, this mixing can be disabled to avoid mixing of different sequences during inference. Crucially, this method is fully compatible with both masked and causal Large Language Model training and inference. △ Less

Submitted 24 October, 2023; originally announced October 2023.

arXiv:2310.08995 [pdf, ps, other]

doi 10.1051/0004-6361/202346191

Scaling slowly rotating asteroids by stellar occultations

Authors: A. Marciniak, J. Ďurech, A. Choukroun, J. Hanuš, W. Ogłoza, R. Szakáts, L. Molnár, A. Pál, F. Monteiro, E. Frappa, W. Beisker, H. Pavlov, J. Moore, R. Adomavičienė, R. Aikawa, S. Andersson, P. Antonini, Y. Argentin, A. Asai, P. Assoignon, J. Barton, P. Baruffetti, K. L. Bath, R. Behrend, L. Benedyktowicz , et al. (154 additional authors not shown)

Abstract: As evidenced by recent survey results, majority of asteroids are slow rotators (P>12 h), but lack spin and shape models due to selection bias. This bias is skewing our overall understanding of the spins, shapes, and sizes of asteroids, as well as of their other properties. Also, diameter determinations for large (>60km) and medium-sized asteroids (between 30 and 60 km) often vary by over 30% for m… ▽ More As evidenced by recent survey results, majority of asteroids are slow rotators (P>12 h), but lack spin and shape models due to selection bias. This bias is skewing our overall understanding of the spins, shapes, and sizes of asteroids, as well as of their other properties. Also, diameter determinations for large (>60km) and medium-sized asteroids (between 30 and 60 km) often vary by over 30% for multiple reasons. Our long-term project is focused on a few tens of slow rotators with periods of up to 60 hours. We aim to obtain their full light curves and reconstruct their spins and shapes. We also precisely scale the models, typically with an accuracy of a few percent. We used wide sets of dense light curves for spin and shape reconstructions via light-curve inversion. Precisely scaling them with thermal data was not possible here because of poor infrared data: large bodies are too bright for WISE mission. Therefore, we recently launched a campaign among stellar occultation observers, to scale these models and to verify the shape solutions, often allowing us to break the mirror pole ambiguity. The presented scheme resulted in shape models for 16 slow rotators, most of them for the first time. Fitting them to stellar occultations resolved previous inconsistencies in size determinations. For around half of the targets, this fitting also allowed us to identify a clearly preferred pole solution, thus removing the ambiguity inherent to light-curve inversion. We also address the influence of the uncertainty of the shape models on the derived diameters. Overall, our project has already provided reliable models for around 50 slow rotators. Such well-determined and scaled asteroid shapes will, e.g. constitute a solid basis for density determinations when coupled with mass information. Spin and shape models continue to fill the gaps caused by various biases. △ Less

Submitted 13 October, 2023; originally announced October 2023.

Comments: Accepted to Astronomy & Astrophysics. 12 pages + appendices

Journal ref: A&A 679, A60 (2023)

arXiv:2208.01369 [pdf, other]

The Face of Affective Disorders

Authors: Christian S. Pilz, Benjamin Clemens, Inka C. Hiss, Christoph Weiss, Ulrich Canzler, Jarek Krajewski, Ute Habel, Steffen Leonhardt

Abstract: We study the statistical properties of facial behaviour altered by the regulation of brain arousal in the clinical domain of psychiatry. The underlying mechanism is linked to the empirical interpretation of the vigilance continuum as behavioral surrogate measurement for certain states of mind. Referring to the classical scalp-based obtrusive measurements, we name the presented method Opto-Electron… ▽ More We study the statistical properties of facial behaviour altered by the regulation of brain arousal in the clinical domain of psychiatry. The underlying mechanism is linked to the empirical interpretation of the vigilance continuum as behavioral surrogate measurement for certain states of mind. Referring to the classical scalp-based obtrusive measurements, we name the presented method Opto-Electronic Encephalography (OEG) which solely relies on modern camera-based real-time signal processing and computer vision. Based upon a stochastic representation as coherence of the face dynamics, reflecting the hemifacial asymmetry in emotion expressions, we demonstrate an almost flawless distinction between patients and healthy controls as well as between the mental disorders depression and schizophrenia and the symptom severity. In contrast to the standard diagnostic process, which is time-consuming, subjective and does not incorporate neurobiological data such as real-time face dynamics, the objective stochastic modeling of the affective responsiveness only requires a few minutes of video-based facial recordings. We also highlight the potential of the methodology as a causal inference model in transdiagnostic analysis to predict the outcome of pharmacological treatment. All results are obtained on a clinical longitudinal data collection with an amount of 99 patients and 43 controls. △ Less

Submitted 5 September, 2022; v1 submitted 2 August, 2022; originally announced August 2022.

Comments: 15 pages. Submitted for Peer Review to the IEEE Transaction on Affective Computing

Report number: rev-2.11-2022

arXiv:2109.00463 [pdf, ps, other]

doi 10.1051/0004-6361/202140991

Properties of slowly rotating asteroids from the Convex Inversion Thermophysical Model

Authors: A. Marciniak, J. Ďurech, V. Alí-Lagoa, W. Ogłoza, R. Szakáts, T. G. Müller, L. Molnár, A. Pál, F. Monteiro, P. Arcoverde, R. Behrend, Z. Benkhaldoun, L. Bernasconi, J. Bosch, S. Brincat, L. Brunetto, M. Butkiewicz - Bąk, F. Del Freo, R. Duffard, M. Evangelista-Santana, G. Farroni, S. Fauvaud, M. Fauvaud, M. Ferrais, S. Geier , et al. (51 additional authors not shown)

Abstract: Results from the TESS mission showed that previous studies strngly underestimated the number of slow rotators, revealing the importance of studying those asteroids. For most slowly rotating asteroids (P > 12), no spin and shape model is available because of observation selection effects. This hampers determination of their thermal parameters and accurate sizes. We continue our campaign in minimi… ▽ More Results from the TESS mission showed that previous studies strngly underestimated the number of slow rotators, revealing the importance of studying those asteroids. For most slowly rotating asteroids (P > 12), no spin and shape model is available because of observation selection effects. This hampers determination of their thermal parameters and accurate sizes. We continue our campaign in minimising selection effects among main belt asteroids. Our targets are slow rotators with low light-curve amplitudes. The goal is to provide their scaled spin and shape models together with thermal inertia, albedo, and surface roughness to complete the statistics. Rich multi-apparition datasets of dense light curves are supplemented with data from Kepler and TESS. In addition to data in the visible range, we also use thermal data from infrared space observatories (IRAS, Akari and WISE) in a combined optimisation process using the Convex Inversion Thermophysical Model (CITPM). This novel method has so far been applied to only a few targets, and in this work we further validate the method. We present the models of 16 slow rotators. All provide good fits to both thermal and visible data. The obtained sizes are on average accurate at the 5% precision, with diameters in the range from 25 to 145 km. The rotation periods of our targets range from 11 to 59 hours, and the thermal inertia covers a wide range of values, from 2 to <400 SI units, not showing any correlation with the period. With this work we increase the sample of slow rotators with reliable spin and shape models and known thermal inertia by 40%. The thermal inertia values of our sample do not display a previously suggested increasing trend with rotation period, which might be due to their small skin depth. △ Less

Submitted 1 September, 2021; originally announced September 2021.

Comments: Accepted to Astronomy & Astrophysics. 10 pages + appendices

Journal ref: A&A 654, A87 (2021)

arXiv:2005.14155 [pdf, other]

doi 10.3847/1538-3881/ab8bcc

Demonstrating high-precision photometry with a CubeSat: ASTERIA observations of 55 Cancri e

Authors: Mary Knapp, Sara Seager, Brice-Olivier Demory, Akshata Krishnamurthy, Matthew W. Smith, Christopher M. Pong, Vanessa P. Bailey, Amanda Donner, Peter Di Pasquale, Brian Campuzano, Colin Smith, Jason Luu, Alessandra Babuscia, Robert L. Bocchino, Jr., Jessica Loveland, Cody Colley, Tobias Gedenk, Tejas Kulkarni, Kyle Hughes, Mary White, Joel Krajewski, Lorraine Fesq

Abstract: ASTERIA (Arcsecond Space Telescope Enabling Research In Astrophysics) is a 6U CubeSat space telescope (10 cm x 20 cm x 30 cm, 10 kg). ASTERIA's primary mission objective was demonstrating two key technologies for reducing systematic noise in photometric observations: high-precision pointing control and high-stabilty thermal control. ASTERIA demonstrated 0.5 arcsecond RMS pointing stability and… ▽ More ASTERIA (Arcsecond Space Telescope Enabling Research In Astrophysics) is a 6U CubeSat space telescope (10 cm x 20 cm x 30 cm, 10 kg). ASTERIA's primary mission objective was demonstrating two key technologies for reducing systematic noise in photometric observations: high-precision pointing control and high-stabilty thermal control. ASTERIA demonstrated 0.5 arcsecond RMS pointing stability and $\pm$10 milliKelvin thermal control of its camera payload during its primary mission, a significant improvement in pointing and thermal performance compared to other spacecraft in ASTERIA's size and mass class. ASTERIA launched in August 2017 and deployed from the International Space Station (ISS) November 2017. During the prime mission (November 2017 -- February 2018) and the first extended mission that followed (March 2018 - May 2018), ASTERIA conducted opportunistic science observations which included collection of photometric data on 55 Cancri, a nearby exoplanetary system with a super-Earth transiting planet. The 55 Cancri data were reduced using a custom pipeline to correct CMOS detector column-dependent gain variations. A Markov Chain Monte Carlo (MCMC) approach was used to simultaneously detrend the photometry using a simple baseline model and fit a transit model. ASTERIA made a marginal detection of the known transiting exoplanet 55 Cancri e ($\sim2$~\Rearth), measuring a transit depth of $374\pm170$ ppm. This is the first detection of an exoplanet transit by a CubeSat. The successful detection of super-Earth 55 Cancri e demonstrates that small, inexpensive spacecraft can deliver high-precision photometric measurements. △ Less

Submitted 28 May, 2020; originally announced May 2020.

Comments: 23 pages, 9 figures. Accepted in AJ

arXiv:cond-mat/0112022 [pdf]

doi 10.1103/PhysRevLett.89.067202

Multiple Field-Induced Phase Transitions in a Geometrically-Frustrated Dipolar Magnet - Gd2Ti2O7

Authors: A. P. Ramirez, B. S. Shastry, A. Hayashi, J. J. Krajewski, D. A. Huse, R. J. Cava

Abstract: Field-driven phase transitions generally arise from competition between Zeeman energy and exchange or crystal-field anisotropy. Here we present the phase diagram of a frustrated pyrochlore magnet Gd2Ti2O7, where crystal field splitting is small compared to the dipolar energy. We find good agreement between zero-temperature critical fields and those obtained from a mean-field model. Here, dipola… ▽ More Field-driven phase transitions generally arise from competition between Zeeman energy and exchange or crystal-field anisotropy. Here we present the phase diagram of a frustrated pyrochlore magnet Gd2Ti2O7, where crystal field splitting is small compared to the dipolar energy. We find good agreement between zero-temperature critical fields and those obtained from a mean-field model. Here, dipolar interactions couple real-space and spin-space, so the transitions in Gd2Ti2O7 arise from field-induced "cooperative anisotropy" reflecting the broken spatial symmetries of the pyrochlore lattice. △ Less

Submitted 3 December, 2001; originally announced December 2001.

Comments: 10pages,5figures: pdf file attached PACS 75.30.Kz, 75.50.Ee, 75.10.-b

Journal ref: Phys. Rev. Lett. 89, 067202 (2002).

arXiv:supr-con/9505002 [pdf, ps, other]

doi 10.1016/0921-4534(95)00312-6

Neutron Scattering Study of Crystal Field Energy Levels and Field Dependence of the Magnetic Order in Superconducting HoNi2B2C

Authors: T. E. Grigereit, J. W. Lynn, R. J. Cava, J. J. Krajewski, W. F. Peck, Jr.

Abstract: Elastic and inelastic neutron scattering measurements have been carried out to investigate the magnetic properties of superconducting (Tc~8K) HoNi2B2C. The inelastic measurements reveal that the lowest two crystal field transitions out of the ground state occurat 11.28(3) and 16.00(2) meV, while the transition of 4.70(9) meV between these two levels is observed at elevated temperatures. The temp… ▽ More Elastic and inelastic neutron scattering measurements have been carried out to investigate the magnetic properties of superconducting (Tc~8K) HoNi2B2C. The inelastic measurements reveal that the lowest two crystal field transitions out of the ground state occurat 11.28(3) and 16.00(2) meV, while the transition of 4.70(9) meV between these two levels is observed at elevated temperatures. The temperature dependence of the intensities of these transitions is consistent with both the ground state and these higher levels being magnetic doublets. The system becomes magnetically long range ordered below 8K, and since this ordering energy kTN ~ 0.69meV << 11.28meV the magnetic properties in the ordered phase are dominated by the ground-state spin dynamics only. The low temperature structure, which coexists with superconductivity, consists of ferromagnetic sheets of Ho{3+ moments in the a-b plane, with the sheets coupled antiferromagnetically along the c-axis. The magnetic state that initially forms on cooling, however, is dominated by an incommensurate spiral antiferromagnetic state along the c-axis, with wave vector qc ~0.054 A-1, in which these ferromagnetic sheets are canted from their low temperature antiparallel configuration by ~17 deg. The intensity for this spiral state reaches a maximum near the reentrant superconducting transition at ~5K; the spiral state then collapses at lower temperature in favor of the commensurate antiferromagnetic state. We have investigated the field dependence of the magnetic order at and above this reentrant superconducting transition. Initially the field rotates the powder particles to align the a-b plane along the field direction, demonstrating that the moments strongly prefer to lie within this plane due to the crystal field anisotropy. Upon subsequently increasing the field at △ Less

Submitted 23 May, 1995; originally announced May 1995.

Comments: RevTex, 7 pages, 11 figures (available upon request); Physica C

Showing 1–9 of 9 results for author: Krajewski, J