Skip to main content

Showing 1–7 of 7 results for author: Duderstadt, B

.
  1. arXiv:2406.18587  [pdf, other

    cs.CV cs.AI

    Nomic Embed Vision: Expanding the Latent Space

    Authors: Zach Nussbaum, Brandon Duderstadt, Andriy Mulyar

    Abstract: This technical report describes the training of nomic-embed-vision, a highly performant, open-code, open-weights image embedding model that shares the same latent space as nomic-embed-text. Together, nomic-embed-vision and nomic-embed-text form the first unified latent space to achieve high performance across vision, language, and multimodal tasks.

    Submitted 6 June, 2024; originally announced June 2024.

  2. arXiv:2406.11938  [pdf, other

    cs.AI cs.MA

    Tracking the perspectives of interacting language models

    Authors: Hayden Helm, Brandon Duderstadt, Youngser Park, Carey E. Priebe

    Abstract: Large language models (LLMs) are capable of producing high quality information at unprecedented rates. As these models continue to entrench themselves in society, the content they produce will become increasingly pervasive in databases that are, in turn, incorporated into the pre-training data, fine-tuning data, retrieval data, etc. of other language models. In this paper we formalize the idea of… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

  3. arXiv:2402.01613  [pdf, other

    cs.CL cs.AI

    Nomic Embed: Training a Reproducible Long Context Text Embedder

    Authors: Zach Nussbaum, John X. Morris, Brandon Duderstadt, Andriy Mulyar

    Abstract: This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on short and long-context tasks. We release the training code and model weights under an Apache 2 license. In contrast with other open-source m… ▽ More

    Submitted 2 February, 2024; originally announced February 2024.

  4. arXiv:2311.04931  [pdf, other

    cs.CL cs.AI

    GPT4All: An Ecosystem of Open Source Compressed Language Models

    Authors: Yuvanesh Anand, Zach Nussbaum, Adam Treat, Aaron Miller, Richard Guo, Ben Schmidt, GPT4All Community, Brandon Duderstadt, Andriy Mulyar

    Abstract: Large language models (LLMs) have recently achieved human-level performance on a range of professional and academic benchmarks. The accessibility of these models has lagged behind their performance. State-of-the-art LLMs require costly infrastructure; are only accessible via rate-limited, geo-locked, and censored web interfaces; and lack publicly available code and technical reports. In this paper… ▽ More

    Submitted 6 November, 2023; originally announced November 2023.

    Comments: Accepted at NLP-OSS at EMNLP 2023

  5. arXiv:2305.05126  [pdf, other

    cs.LG cs.AI stat.ME

    Comparing Foundation Models using Data Kernels

    Authors: Brandon Duderstadt, Hayden S. Helm, Carey E. Priebe

    Abstract: Recent advances in self-supervised learning and neural network scaling have enabled the creation of large models, known as foundation models, which can be easily adapted to a wide range of downstream tasks. The current paradigm for comparing foundation models involves evaluating them with aggregate metrics on various benchmark datasets. This method of model comparison is heavily dependent on the c… ▽ More

    Submitted 7 January, 2024; v1 submitted 8 May, 2023; originally announced May 2023.

  6. arXiv:2011.06557  [pdf, other

    stat.ML cs.LG stat.ME

    A partition-based similarity for classification distributions

    Authors: Hayden S. Helm, Ronak D. Mehta, Brandon Duderstadt, Weiwei Yang, Christoper M. White, Ali Geisa, Joshua T. Vogelstein, Carey E. Priebe

    Abstract: Herein we define a measure of similarity between classification distributions that is both principled from the perspective of statistical pattern recognition and useful from the perspective of machine learning practitioners. In particular, we propose a novel similarity on classification distributions, dubbed task similarity, that quantifies how an optimally-transformed optimal representation for a… ▽ More

    Submitted 12 November, 2020; originally announced November 2020.

  7. arXiv:1803.03367  [pdf, other

    q-bio.OT

    NeuroStorm: Accelerating Brain Science Discovery in the Cloud

    Authors: Gregory Kiar, Robert J. Anderson, Alex Baden, Alexandra Badea, Eric W. Bridgeford, Andrew Champion, Vikram Chandrashekhar, Forrest Collman, Brandon Duderstadt, Alan C. Evans, Florian Engert, Benjamin Falk, Tristan Glatard, William R. Gray Roncal, David N. Kennedy, Jeremy Maitin-Shepard, Ryan A. Marren, Onyeka Nnaemeka, Eric Perlman, Sharmishtaas Seshamani, Eric T. Trautman, Daniel J. Tward, Pedro Antonio Valdés-Sosa, Qing Wang, Michael I. Miller , et al. (2 additional authors not shown)

    Abstract: Neuroscientists are now able to acquire data at staggering rates across spatiotemporal scales. However, our ability to capitalize on existing datasets, tools, and intellectual capacities is hampered by technical challenges. The key barriers to accelerating scientific discovery correspond to the FAIR data principles: findability, global access to data, software interoperability, and reproducibility… ▽ More

    Submitted 20 March, 2018; v1 submitted 8 March, 2018; originally announced March 2018.

    Comments: 10 pages, 4 figures, hackathon report