-
Scaling Laws for Fine-Grained Mixture of Experts
Authors:
Jakub Krajewski,
Jan Ludziejewski,
Kamil Adamczewski,
Maciej Pióro,
Michał Krutul,
Szymon Antoniak,
Kamil Ciebiera,
Krystian Król,
Tomasz Odrzygóźdź,
Piotr Sankowski,
Marek Cygan,
Sebastian Jaszczur
Abstract:
Mixture of Experts (MoE) models have emerged as a primary solution for reducing the computational cost of Large Language Models. In this work, we analyze their scaling properties, incorporating an expanded range of variables. Specifically, we introduce a new hyperparameter, granularity, whose adjustment enables precise control over the size of the experts. Building on this, we establish scaling la…
▽ More
Mixture of Experts (MoE) models have emerged as a primary solution for reducing the computational cost of Large Language Models. In this work, we analyze their scaling properties, incorporating an expanded range of variables. Specifically, we introduce a new hyperparameter, granularity, whose adjustment enables precise control over the size of the experts. Building on this, we establish scaling laws for fine-grained MoE, taking into account the number of training tokens, model size, and granularity. Leveraging these laws, we derive the optimal training configuration for a given computational budget. Our findings not only show that MoE models consistently outperform dense Transformers but also highlight that the efficiency gap between dense and MoE models widens as we scale up the model size and training budget. Furthermore, we demonstrate that the common practice of setting the size of experts in MoE to mirror the feed-forward layer is not optimal at almost any computational budget.
△ Less
Submitted 12 February, 2024;
originally announced February 2024.
-
Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation
Authors:
Szymon Antoniak,
Sebastian Jaszczur,
Michał Krutul,
Maciej Pióro,
Jakub Krajewski,
Jan Ludziejewski,
Tomasz Odrzygóźdź,
Marek Cygan
Abstract:
Despite the promise of Mixture of Experts (MoE) models in increasing parameter counts of Transformer models while maintaining training and inference costs, their application carries notable drawbacks. The key strategy of these models is to, for each processed token, activate at most a few experts - subsets of an extensive feed-forward layer. But this approach is not without its challenges. The ope…
▽ More
Despite the promise of Mixture of Experts (MoE) models in increasing parameter counts of Transformer models while maintaining training and inference costs, their application carries notable drawbacks. The key strategy of these models is to, for each processed token, activate at most a few experts - subsets of an extensive feed-forward layer. But this approach is not without its challenges. The operation of matching experts and tokens is discrete, which makes MoE models prone to issues like training instability and uneven expert utilization. Existing techniques designed to address these concerns, such as auxiliary losses or balance-aware matching, result either in lower model performance or are more difficult to train. In response to these issues, we propose Mixture of Tokens, a fully-differentiable model that retains the benefits of MoE architectures while avoiding the aforementioned difficulties. Rather than routing tokens to experts, this approach mixes tokens from different examples prior to feeding them to experts, enabling the model to learn from all token-expert combinations. Importantly, this mixing can be disabled to avoid mixing of different sequences during inference. Crucially, this method is fully compatible with both masked and causal Large Language Model training and inference.
△ Less
Submitted 24 October, 2023;
originally announced October 2023.
-
Fast and Precise: Adjusting Planning Horizon with Adaptive Subgoal Search
Authors:
Michał Zawalski,
Michał Tyrolski,
Konrad Czechowski,
Tomasz Odrzygóźdź,
Damian Stachura,
Piotr Piękos,
Yuhuai Wu,
Łukasz Kuciński,
Piotr Miłoś
Abstract:
Complex reasoning problems contain states that vary in the computational cost required to determine a good action plan. Taking advantage of this property, we propose Adaptive Subgoal Search (AdaSubS), a search method that adaptively adjusts the planning horizon. To this end, AdaSubS generates diverse sets of subgoals at different distances. A verification mechanism is employed to filter out unreac…
▽ More
Complex reasoning problems contain states that vary in the computational cost required to determine a good action plan. Taking advantage of this property, we propose Adaptive Subgoal Search (AdaSubS), a search method that adaptively adjusts the planning horizon. To this end, AdaSubS generates diverse sets of subgoals at different distances. A verification mechanism is employed to filter out unreachable subgoals swiftly, allowing to focus on feasible further subgoals. In this way, AdaSubS benefits from the efficiency of planning with longer subgoals and the fine control with the shorter ones, and thus scales well to difficult planning problems. We show that AdaSubS significantly surpasses hierarchical planning algorithms on three complex reasoning tasks: Sokoban, the Rubik's Cube, and inequality proving benchmark INT.
△ Less
Submitted 25 May, 2024; v1 submitted 1 June, 2022;
originally announced June 2022.
-
Thor: Wielding Hammers to Integrate Language Models and Automated Theorem Provers
Authors:
Albert Q. Jiang,
Wenda Li,
Szymon Tworkowski,
Konrad Czechowski,
Tomasz Odrzygóźdź,
Piotr Miłoś,
Yuhuai Wu,
Mateja Jamnik
Abstract:
In theorem proving, the task of selecting useful premises from a large library to unlock the proof of a given conjecture is crucially important. This presents a challenge for all theorem provers, especially the ones based on language models, due to their relative inability to reason over huge volumes of premises in text form. This paper introduces Thor, a framework integrating language models and…
▽ More
In theorem proving, the task of selecting useful premises from a large library to unlock the proof of a given conjecture is crucially important. This presents a challenge for all theorem provers, especially the ones based on language models, due to their relative inability to reason over huge volumes of premises in text form. This paper introduces Thor, a framework integrating language models and automated theorem provers to overcome this difficulty. In Thor, a class of methods called hammers that leverage the power of automated theorem provers are used for premise selection, while all other tasks are designated to language models. Thor increases a language model's success rate on the PISA dataset from $39\%$ to $57\%$, while solving $8.2\%$ of problems neither language models nor automated theorem provers are able to solve on their own. Furthermore, with a significantly smaller computational budget, Thor can achieve a success rate on the MiniF2F dataset that is on par with the best existing methods. Thor can be instantiated for the majority of popular interactive theorem provers via a straightforward protocol we provide.
△ Less
Submitted 22 May, 2022;
originally announced May 2022.
-
Subgoal Search For Complex Reasoning Tasks
Authors:
Konrad Czechowski,
Tomasz Odrzygóźdź,
Marek Zbysiński,
Michał Zawalski,
Krzysztof Olejnik,
Yuhuai Wu,
Łukasz Kuciński,
Piotr Miłoś
Abstract:
Humans excel in solving complex reasoning tasks through a mental process of moving from one idea to a related one. Inspired by this, we propose Subgoal Search (kSubS) method. Its key component is a learned subgoal generator that produces a diversity of subgoals that are both achievable and closer to the solution. Using subgoals reduces the search space and induces a high-level search graph suitabl…
▽ More
Humans excel in solving complex reasoning tasks through a mental process of moving from one idea to a related one. Inspired by this, we propose Subgoal Search (kSubS) method. Its key component is a learned subgoal generator that produces a diversity of subgoals that are both achievable and closer to the solution. Using subgoals reduces the search space and induces a high-level search graph suitable for efficient planning. In this paper, we implement kSubS using a transformer-based subgoal module coupled with the classical best-first search framework. We show that a simple approach of generating $k$-th step ahead subgoals is surprisingly efficient on three challenging domains: two popular puzzle games, Sokoban and the Rubik's Cube, and an inequality proving benchmark INT. kSubS achieves strong results including state-of-the-art on INT within a modest computational budget.
△ Less
Submitted 3 April, 2024; v1 submitted 25 August, 2021;
originally announced August 2021.
-
Nonplanar isoperimetric inequality for random groups
Authors:
Tomasz Odrzygóźdź
Abstract:
The goal of this note is to generalize Isoperimetric Inequality for random groups to the class of non-planar diagrams of bounded number of faces.
The goal of this note is to generalize Isoperimetric Inequality for random groups to the class of non-planar diagrams of bounded number of faces.
△ Less
Submitted 25 April, 2021;
originally announced April 2021.
-
Bent walls for random groups in the square and hexagonal model
Authors:
Tomasz Odrzygóźdź
Abstract:
We consider two random group models: the hexagonal model and the square model, defined as the quotient of a free group by a random set of reduced words of length four and six respectively. Our first main result is that in this model there exists a sharp density threshold for Kazhdan's Property (T) and it equals 1/3. Our second main result is that for densities < 3/8 a random group in the square mo…
▽ More
We consider two random group models: the hexagonal model and the square model, defined as the quotient of a free group by a random set of reduced words of length four and six respectively. Our first main result is that in this model there exists a sharp density threshold for Kazhdan's Property (T) and it equals 1/3. Our second main result is that for densities < 3/8 a random group in the square model with overwhelming probability does not have Property (T). Moreover, we provide a new version of the Isoperimetric Inequality that concerns non-planar diagrams and we introduce new geometrical tools to investigate random groups: trees of loops, diagrams collared by a tree of loops and specific codimension one structures in the Cayley complex, called bent hypergraphs.
△ Less
Submitted 22 June, 2019; v1 submitted 12 June, 2019;
originally announced June 2019.
-
Cubulating random groups in the square model
Authors:
Tomasz Odrzygóźdź
Abstract:
Our main result is that for densities $<\frac{3}{10}$ a random group in the square model has the Haagerup property and is residually finite. Moreover, we generalize the Isoperimetric Inequality, to some class of non-planar diagrams and, using this, we introduce a system of modified hypergraphs providing the structure of a space with walls on the Cayley complex of a random group. Then we show that…
▽ More
Our main result is that for densities $<\frac{3}{10}$ a random group in the square model has the Haagerup property and is residually finite. Moreover, we generalize the Isoperimetric Inequality, to some class of non-planar diagrams and, using this, we introduce a system of modified hypergraphs providing the structure of a space with walls on the Cayley complex of a random group. Then we show that the natural action of a random group on this space with walls is proper, which gives the proper action of a random group on a CAT(0) cube complex.
△ Less
Submitted 9 October, 2016;
originally announced October 2016.
-
The square model for random groups
Authors:
Tomasz Odrzygóźdź
Abstract:
We introduce a new random group model called the square model: we quotient a free group on $n$ generators by a random set of relations, each of which is a reduced word of length four. We prove, as in the Gromov density model, that for densities $> \frac{1}{2}$ a random group in the square model is trivial with overwhelming probability and for densities $<\frac{1}{2}$ a random group is with overwhe…
▽ More
We introduce a new random group model called the square model: we quotient a free group on $n$ generators by a random set of relations, each of which is a reduced word of length four. We prove, as in the Gromov density model, that for densities $> \frac{1}{2}$ a random group in the square model is trivial with overwhelming probability and for densities $<\frac{1}{2}$ a random group is with overwhelming probability hyperbolic. Moreover we show that for densities $\frac{1}{4} < d < \frac{1}{3}$ a random group in the square model does not have Property (T). Inspired by the results for the triangular model we prove that for densities $<\frac{1}{4}$ in the square model, a random group is free with overwhelming probability. We also introduce abstract diagrams with fixed edges and prove a generalization of the isoperimetric inequality.
△ Less
Submitted 13 May, 2014; v1 submitted 9 May, 2014;
originally announced May 2014.
-
Half-page derivation of the Thomas precession
Authors:
Andrzej Dragan,
Tomasz Odrzygóźdź
Abstract:
Instantaneous derivation of the Thomas precession with only basic vector calculus.
Instantaneous derivation of the Thomas precession with only basic vector calculus.
△ Less
Submitted 8 November, 2012;
originally announced November 2012.