-
How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources
Authors:
Yizhong Wang,
Hamish Ivison,
Pradeep Dasigi,
Jack Hessel,
Tushar Khot,
Khyathi Raghavi Chandu,
David Wadden,
Kelsey MacMillan,
Noah A. Smith,
Iz Beltagy,
Hannaneh Hajishirzi
Abstract:
In this work we explore recent advances in instruction-tuning language models on a range of open instruction-following datasets. Despite recent claims that open models can be on par with state-of-the-art proprietary models, these claims are often accompanied by limited evaluation, making it difficult to compare models across the board and determine the utility of various resources. We provide a la…
▽ More
In this work we explore recent advances in instruction-tuning language models on a range of open instruction-following datasets. Despite recent claims that open models can be on par with state-of-the-art proprietary models, these claims are often accompanied by limited evaluation, making it difficult to compare models across the board and determine the utility of various resources. We provide a large set of instruction-tuned models from 6.7B to 65B parameters in size, trained on 12 instruction datasets ranging from manually curated (e.g., OpenAssistant) to synthetic and distilled (e.g., Alpaca) and systematically evaluate them on their factual knowledge, reasoning, multilinguality, coding, and open-ended instruction following abilities through a collection of automatic, model-based, and human-based metrics. We further introduce Tülu, our best performing instruction-tuned model suite finetuned on a combination of high-quality open resources. Our experiments show that different instruction-tuning datasets can uncover or enhance specific skills, while no single dataset (or combination) provides the best performance across all evaluations. Interestingly, we find that model and human preference-based evaluations fail to reflect differences in model capabilities exposed by benchmark-based evaluations, suggesting the need for the type of systemic evaluation performed in this work. Our evaluations show that the best model in any given evaluation reaches on average 87% of ChatGPT performance, and 73% of GPT-4 performance, suggesting that further investment in building better base models and instruction-tuning data is required to close the gap. We release our instruction-tuned models, including a fully finetuned 65B Tülu, along with our code, data, and evaluation framework at https://github.com/allenai/open-instruct to facilitate future research.
△ Less
Submitted 30 October, 2023; v1 submitted 7 June, 2023;
originally announced June 2023.
-
The Semantic Reader Project: Augmenting Scholarly Documents through AI-Powered Interactive Reading Interfaces
Authors:
Kyle Lo,
Joseph Chee Chang,
Andrew Head,
Jonathan Bragg,
Amy X. Zhang,
Cassidy Trier,
Chloe Anastasiades,
Tal August,
Russell Authur,
Danielle Bragg,
Erin Bransom,
Isabel Cachola,
Stefan Candra,
Yoganand Chandrasekhar,
Yen-Sung Chen,
Evie Yu-Yen Cheng,
Yvonne Chou,
Doug Downey,
Rob Evans,
Raymond Fok,
Fangzhou Hu,
Regan Huff,
Dongyeop Kang,
Tae Soo Kim,
Rodney Kinney
, et al. (30 additional authors not shown)
Abstract:
Scholarly publications are key to the transfer of knowledge from scholars to others. However, research papers are information-dense, and as the volume of the scientific literature grows, the need for new technology to support the reading process grows. In contrast to the process of finding papers, which has been transformed by Internet technology, the experience of reading research papers has chan…
▽ More
Scholarly publications are key to the transfer of knowledge from scholars to others. However, research papers are information-dense, and as the volume of the scientific literature grows, the need for new technology to support the reading process grows. In contrast to the process of finding papers, which has been transformed by Internet technology, the experience of reading research papers has changed little in decades. The PDF format for sharing research papers is widely used due to its portability, but it has significant downsides including: static content, poor accessibility for low-vision readers, and difficulty reading on mobile devices. This paper explores the question "Can recent advances in AI and HCI power intelligent, interactive, and accessible reading interfaces -- even for legacy PDFs?" We describe the Semantic Reader Project, a collaborative effort across multiple institutions to explore automatic creation of dynamic reading interfaces for research papers. Through this project, we've developed ten research prototype interfaces and conducted usability studies with more than 300 participants and real-world users showing improved reading experiences for scholars. We've also released a production reading interface for research papers that will incorporate the best features as they mature. We structure this paper around challenges scholars and the public face when reading research papers -- Discovery, Efficiency, Comprehension, Synthesis, and Accessibility -- and present an overview of our progress and remaining open challenges.
△ Less
Submitted 23 April, 2023; v1 submitted 24 March, 2023;
originally announced March 2023.
-
The Semantic Scholar Open Data Platform
Authors:
Rodney Kinney,
Chloe Anastasiades,
Russell Authur,
Iz Beltagy,
Jonathan Bragg,
Alexandra Buraczynski,
Isabel Cachola,
Stefan Candra,
Yoganand Chandrasekhar,
Arman Cohan,
Miles Crawford,
Doug Downey,
Jason Dunkelberger,
Oren Etzioni,
Rob Evans,
Sergey Feldman,
Joseph Gorney,
David Graham,
Fangzhou Hu,
Regan Huff,
Daniel King,
Sebastian Kohlmeier,
Bailey Kuehl,
Michael Langan,
Daniel Lin
, et al. (23 additional authors not shown)
Abstract:
The volume of scientific output is creating an urgent need for automated tools to help scientists keep up with developments in their field. Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by hel** scholars discover and understand scientific literature. We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF conte…
▽ More
The volume of scientific output is creating an urgent need for automated tools to help scientists keep up with developments in their field. Semantic Scholar (S2) is an open data platform and website aimed at accelerating science by hel** scholars discover and understand scientific literature. We combine public and proprietary data sources using state-of-the-art techniques for scholarly PDF content extraction and automatic knowledge graph construction to build the Semantic Scholar Academic Graph, the largest open scientific literature graph to-date, with 200M+ papers, 80M+ authors, 550M+ paper-authorship edges, and 2.4B+ citation edges. The graph includes advanced semantic features such as structurally parsed text, natural language summaries, and vector embeddings. In this paper, we describe the components of the S2 data processing pipeline and the associated APIs offered by the platform. We will update this living document to reflect changes as we add new data offerings and improve existing services.
△ Less
Submitted 24 January, 2023;
originally announced January 2023.
-
A Comparative Analysis of Ookla Speedtest and Measurement Labs Network Diagnostic Test (NDT7)
Authors:
Kyle MacMillan,
Tarun Mangla,
James Saxon,
Nicole P. Marwell,
Nick Feamster
Abstract:
Consumers, regulators, and ISPs all use client-based "speed tests" to measure network performance, both in single-user settings and in aggregate. Two prevalent speed tests, Ookla's Speedtest and Measurement Lab's Network Diagnostic Test (NDT), are often used for similar purposes, despite having significant differences in both the test design and implementation, and in the infrastructure used to pe…
▽ More
Consumers, regulators, and ISPs all use client-based "speed tests" to measure network performance, both in single-user settings and in aggregate. Two prevalent speed tests, Ookla's Speedtest and Measurement Lab's Network Diagnostic Test (NDT), are often used for similar purposes, despite having significant differences in both the test design and implementation, and in the infrastructure used to perform measurements. In this paper, we present the first-ever comparative evaluation of Ookla and NDT7 (the latest version of NDT), both in controlled and wide-area settings. Our goal is to characterize when and to what extent these two speed tests yield different results, as well as the factors that contribute to the differences. To study the effects of the test design, we conduct a series of controlled, in-lab experiments under a comprehensive set of network conditions and usage modes (e.g., TCP congestion control, native vs. browser client). Our results show that Ookla and NDT7 report similar speeds under most in-lab conditions, with the exception of networks that experience high latency, where Ookla consistently reports higher throughput. To characterize the behavior of these tools in wide-area deployment, we collect more than 80,000 pairs of Ookla and NDT7 measurements across nine months and 126 households, with a range of ISPs and speed tiers. This first-of-its-kind paired-test analysis reveals many previously unknown systemic issues, including high variability in NDT7 test results and systematically under-performing servers in the Ookla network.
△ Less
Submitted 25 January, 2023; v1 submitted 24 May, 2022;
originally announced May 2022.
-
Measuring the Consolidation of DNS and Web Hosting Providers
Authors:
Synthia Wang,
Kyle MacMillan,
Brennan Schaffner,
Nick Feamster,
Marshini Chetty
Abstract:
Despite the Internet's continued growth, it increasingly depends on a small set of service providers to support Domain Name System (DNS) and web content hosting. This trend poses many potential threats including susceptibility to outages, failures, and potential censorship by providers. This paper aims to quantify consolidation in terms of popular domains' reliance on a small set of organizations…
▽ More
Despite the Internet's continued growth, it increasingly depends on a small set of service providers to support Domain Name System (DNS) and web content hosting. This trend poses many potential threats including susceptibility to outages, failures, and potential censorship by providers. This paper aims to quantify consolidation in terms of popular domains' reliance on a small set of organizations for both DNS and web hosting. We highlight the extent to which a set of relatively few platforms host the authoritative name servers and web content for the top million websites. Our results show that both DNS and web hosting are concentrated, with Cloudflare and Amazon hosting over $30\%$ of the domains for both services. With the addition of Akamai, Fastly, and Google, these five organizations host $60\%$ of index pages in the Tranco top 10K, as well as the majority of external page resources. These trends are consistent across six different global vantage points, indicating that consolidation is happening globally and popular organizations can influence users' online experience across the world.
△ Less
Submitted 30 January, 2024; v1 submitted 28 October, 2021;
originally announced October 2021.
-
Measuring the Performance and Network Utilization of Popular Video Conferencing Applications
Authors:
Kyle MacMillan,
Tarun Mangla,
James Saxon,
Nick Feamster
Abstract:
Video conferencing applications (VCAs) have become a critical Internet application, even more so during the COVID-19 pandemic, as users worldwide now rely on them for work, school, and telehealth. It is thus increasingly important to understand the resource requirements of different VCAs and how they perform under different network conditions, including: how much speed (upstream and downstream thr…
▽ More
Video conferencing applications (VCAs) have become a critical Internet application, even more so during the COVID-19 pandemic, as users worldwide now rely on them for work, school, and telehealth. It is thus increasingly important to understand the resource requirements of different VCAs and how they perform under different network conditions, including: how much speed (upstream and downstream throughput) a VCA needs to support high quality of experience; how VCAs perform under temporary reductions in available capacity; how they compete with themselves, with each other, and with other applications; and how usage modality (e.g., number of participants) affects utilization. We study three modern VCAs: Zoom, Google Meet, and Microsoft Teams. Answers to these questions differ substantially depending on VCA. First, the average utilization on an unconstrained link varies between 0.8 Mbps and 1.9 Mbps. Given temporary reduction of capacity, some VCAs can take as long as 50 seconds to recover to steady state. Differences in proprietary congestion control algorithms also result in unfair bandwidth allocations: in constrained bandwidth settings, one Zoom video conference can consume more than 75% of the available bandwidth when competing with another VCA (e.g., Meet, Teams). For some VCAs, client utilization can decrease as the number of participants increases, due to the reduced video resolution of each participant's video stream given a larger number of participants. Finally, one participant's viewing mode (e.g., pinning a speaker) can affect the upstream utilization of other participants.
△ Less
Submitted 27 May, 2021;
originally announced May 2021.
-
Evaluating Snowflake as an Indistinguishable Censorship Circumvention Tool
Authors:
Kyle MacMillan,
Jordan Holland,
Prateek Mittal
Abstract:
Tor is the most well-known tool for circumventing censorship. Unfortunately, Tor traffic has been shown to be detectable using deep-packet inspection. WebRTC is a popular web frame-work that enables browser-to-browser connections. Snowflake is a novel pluggable transport that leverages WebRTC to connect Tor clients to the Tor network. In theory, Snowflake was created to be indistinguishable from o…
▽ More
Tor is the most well-known tool for circumventing censorship. Unfortunately, Tor traffic has been shown to be detectable using deep-packet inspection. WebRTC is a popular web frame-work that enables browser-to-browser connections. Snowflake is a novel pluggable transport that leverages WebRTC to connect Tor clients to the Tor network. In theory, Snowflake was created to be indistinguishable from other WebRTC services. In this paper, we evaluate the indistinguishability of Snowflake. We collect over 6,500 DTLS handshakes from Snowflake, Facebook Messenger, Google Hangouts, and Discord WebRTC connections and show that Snowflake is identifiable among these applications with 100% accuracy. We show that several features, including the extensions offered and the number of packets in the handshake, distinguish Snowflake among these services. Finally, we suggest recommendations for improving identification resistance in Snowflake. We have made the dataset publicly available.
△ Less
Submitted 14 October, 2020; v1 submitted 23 July, 2020;
originally announced August 2020.
-
Primary Pseudoperfect Numbers, Arithmetic Progressions, and the Erdős-Moser Equation
Authors:
Jonathan Sondow,
Kieren MacMillan
Abstract:
A primary pseudoperfect number (PPN) is an integer $K > 1$ such that the reciprocals of $K$ and its prime factors sum to 1. PPNs arise in studying perfectly weighted graphs and singularities of algebraic surfaces, and are related to Sylvester's sequence, Giuga numbers, Znám's problem, the inheritance problem, and Curtiss's bound on solutions of a unit fraction equation.
Here we show…
▽ More
A primary pseudoperfect number (PPN) is an integer $K > 1$ such that the reciprocals of $K$ and its prime factors sum to 1. PPNs arise in studying perfectly weighted graphs and singularities of algebraic surfaces, and are related to Sylvester's sequence, Giuga numbers, Znám's problem, the inheritance problem, and Curtiss's bound on solutions of a unit fraction equation.
Here we show $K \equiv 6 \pmod{6^2}$ if $6\mid K$, and uncover a remarkable $7$-term arithmetic progression of residues modulo $6^2\cdot8$ in the sequence of known PPNs. On that basis, we pose a conjecture which leads to a conditional proof of the new record lower bound $k>10^{3.99\times10^{20}}$ on any non-trivial solution to the Erdős-Moser Diophantine equation $1^n + 2^n + \dotsb + k^n = (k+1)^n$.
△ Less
Submitted 16 December, 2018;
originally announced December 2018.
-
Topic supervised non-negative matrix factorization
Authors:
Kelsey MacMillan,
James D. Wilson
Abstract:
Topic models have been extensively used to organize and interpret the contents of large, unstructured corpora of text documents. Although topic models often perform well on traditional training vs. test set evaluations, it is often the case that the results of a topic model do not align with human interpretation. This interpretability fallacy is largely due to the unsupervised nature of topic mode…
▽ More
Topic models have been extensively used to organize and interpret the contents of large, unstructured corpora of text documents. Although topic models often perform well on traditional training vs. test set evaluations, it is often the case that the results of a topic model do not align with human interpretation. This interpretability fallacy is largely due to the unsupervised nature of topic models, which prohibits any user guidance on the results of a model. In this paper, we introduce a semi-supervised method called topic supervised non-negative matrix factorization (TS-NMF) that enables the user to provide labeled example documents to promote the discovery of more meaningful semantic structure of a corpus. In this way, the results of TS-NMF better match the intuition and desired labeling of the user. The core of TS-NMF relies on solving a non-convex optimization problem for which we derive an iterative algorithm that is shown to be monotonic and convergent to a local optimum. We demonstrate the practical utility of TS-NMF on the Reuters and PubMed corpora, and find that TS-NMF is especially useful for conceptual or broad topics, where topic key terms are not well understood. Although identifying an optimal latent structure for the data is not a primary objective of the proposed approach, we find that TS-NMF achieves higher weighted Jaccard similarity scores than the contemporary methods, (unsupervised) NMF and latent Dirichlet allocation, at supervision rates as low as 10% to 20%.
△ Less
Submitted 2 July, 2017; v1 submitted 12 June, 2017;
originally announced June 2017.
-
Reducing the Erdos-Moser equation 1^n + 2^n + . . . + k^n = (k+1)^n modulo k and k^2
Authors:
Jonathan Sondow,
Kieren MacMillan
Abstract:
An open conjecture of Erdos and Moser is that the only solution of the Diophantine equation in the title is the trivial solution 1+2=3. Reducing the equation modulo k and k^2, we give necessary and sufficient conditions on solutions to the resulting congruence and supercongruence. A corollary is a new proof of Moser's result that the conjecture is true for odd exponents n. We also connect solution…
▽ More
An open conjecture of Erdos and Moser is that the only solution of the Diophantine equation in the title is the trivial solution 1+2=3. Reducing the equation modulo k and k^2, we give necessary and sufficient conditions on solutions to the resulting congruence and supercongruence. A corollary is a new proof of Moser's result that the conjecture is true for odd exponents n. We also connect solutions k of the congruence to primary pseudoperfect numbers and to a result of Zagier. The proofs use divisibility properties of power sums as well as Lerch's relation between Fermat and Wilson quotients.
△ Less
Submitted 9 November, 2010;
originally announced November 2010.
-
Proofs of power sum and binomial coefficient congruences via Pascal's identity
Authors:
Kieren MacMillan,
Jonathan Sondow
Abstract:
A frequently cited theorem says that for n > 0 and prime p, the sum of the first p n-th powers is congruent to -1 modulo p if p-1 divides n, and to 0 otherwise. We survey the main ingredients in several known proofs. Then we give an elementary proof, using an identity for power sums proven by Pascal in 1654. An application is a simple proof of a congruence for certain sums of binomial coefficients…
▽ More
A frequently cited theorem says that for n > 0 and prime p, the sum of the first p n-th powers is congruent to -1 modulo p if p-1 divides n, and to 0 otherwise. We survey the main ingredients in several known proofs. Then we give an elementary proof, using an identity for power sums proven by Pascal in 1654. An application is a simple proof of a congruence for certain sums of binomial coefficients, due to Hermite and Bachmann.
△ Less
Submitted 30 October, 2010;
originally announced November 2010.
-
Divisibility of Power Sums and the Generalized Erdos-Moser Equation
Authors:
Kieren MacMillan,
Jonathan Sondow
Abstract:
Using elementary methods, we determine the highest power of 2 dividing a power sum 1^n + 2^n + . . . + m^n, generalizing Lengyel's formula for the case where m is itself a power of 2. An application is a simple proof of Moree's result that, if (a,m,n) is any solution of the generalized Erdos-Moser Diophantine equation 1^n + 2^n + . . . + (m-1)^n = am^n, then m is odd.
Using elementary methods, we determine the highest power of 2 dividing a power sum 1^n + 2^n + . . . + m^n, generalizing Lengyel's formula for the case where m is itself a power of 2. An application is a simple proof of Moree's result that, if (a,m,n) is any solution of the generalized Erdos-Moser Diophantine equation 1^n + 2^n + . . . + (m-1)^n = am^n, then m is odd.
△ Less
Submitted 19 May, 2011; v1 submitted 11 October, 2010;
originally announced October 2010.