Search | arXiv e-print repository

FlowFPX: Nimble Tools for Debugging Floating-Point Exceptions

Authors: Taylor Allred, Xinyi Li, Ashton Wiersdorf, Ben Greenman, Ganesh Gopalakrishnan

Abstract: Reliable numerical computations are central to scientific computing, but the floating-point arithmetic that enables large-scale models is error-prone. Numeric exceptions are a common occurrence and can propagate through code, leading to flawed results. This paper presents FlowFPX, a toolkit for systematically debugging floating-point exceptions by recording their flow, coalescing exception context… ▽ More Reliable numerical computations are central to scientific computing, but the floating-point arithmetic that enables large-scale models is error-prone. Numeric exceptions are a common occurrence and can propagate through code, leading to flawed results. This paper presents FlowFPX, a toolkit for systematically debugging floating-point exceptions by recording their flow, coalescing exception contexts, and fuzzing in select locations. These tools help scientists discover when exceptions happen and track down their origin, smoothing the way to a reliable codebase. △ Less

Submitted 22 March, 2024; originally announced March 2024.

Comments: Presented at JuliaCon 2023; to appear in JuliaCon proceedings

arXiv:2403.02409 [pdf]

doi 10.22152/programming-journal.org/2024/8/12

Privacy-Respecting Type Error Telemetry at Scale

Authors: Ben Greenman, Alan Jeffrey, Shriram Krishnamurthi, Mitesh Shah

Abstract: Context: Roblox Studio lets millions of creators build interactive experiences by programming in a variant of Lua called Luau. The creators form a broad group, ranging from novices writing their first script to professional developers; thus, Luau must support a wide audience. As part of its efforts to support all kinds of programmers, Luau includes an optional, gradual type system and goes to grea… ▽ More Context: Roblox Studio lets millions of creators build interactive experiences by programming in a variant of Lua called Luau. The creators form a broad group, ranging from novices writing their first script to professional developers; thus, Luau must support a wide audience. As part of its efforts to support all kinds of programmers, Luau includes an optional, gradual type system and goes to great lengths to minimize false positive errors. Inquiry: Since Luau is currently being used by many creators, we want to collect data to improve the language and, in particular, the type system. The standard way to collect data is to deploy client-side telemetry; however, we cannot scrape personal data or proprietary information, which means we cannot collect source code fragments, error messages, or even filepaths. The research questions are thus about how to conduct telemetry that is not invasive and obtain insights from it about type errors. Approach: We designed and implemented a pseudonymized, randomly-sampling telemetry system for Luau. Telemetry records include a timestamp, a session id, a reason for sending, and a numeric summary of the most recent type analyses. This information lets us study type errors over time without revealing private data. We deployed the system in Roblox Studio during Spring 2023 and collected over 1.5 million telemetry records from over 340,000 sessions. Knowledge: We present several findings about Luau, all of which suggest that telemetry is an effective way to study type error pragmatics. One of the less-surprising findings is that opt-in gradual types are unpopular: there is an 100x gap between the number of untyped Luau sessions and the number of typed ones. One surprise is that the strict mode for type analysis is overly conservative about interactions with data assets. A reassuring finding is that type analysis rarely hits its internal limits on problem size. Grounding: Our findings are supported by a dataset of over 1.5 million telemetry records. The data and scripts for analyzing it are available in an artifact. Importance: Beyond the immediate benefits to Luau, our findings about types and type errors have implications for adoption and ergonomics in other gradual languages such as TypeScript, Elixir, and Typed Racket. Our telemetry design is of broad interest, as it reports on type errors without revealing sensitive information. △ Less

Submitted 4 March, 2024; originally announced March 2024.

Journal ref: The Art, Science, and Engineering of Programming, 2024, Vol. 8, Issue 3, Article 12

arXiv:2401.00021 [pdf]

doi 10.22152/programming-journal.org/2024/8/7

Conceptual Mutation Testing for Student Programming Misconceptions

Authors: Siddhartha Prasad, Ben Greenman, Tim Nelson, Shriram Krishnamurthi

Abstract: Context: Students often misunderstand programming problem descriptions. This can lead them to solve the wrong problem, which creates frustration, obstructs learning, and imperils grades. Researchers have found that students can be made to better understand the problem by writing examples before they start programming. These examples are checked against correct and wrong implementations -- analogou… ▽ More Context: Students often misunderstand programming problem descriptions. This can lead them to solve the wrong problem, which creates frustration, obstructs learning, and imperils grades. Researchers have found that students can be made to better understand the problem by writing examples before they start programming. These examples are checked against correct and wrong implementations -- analogous to mutation testing -- provided by course staff. Doing so results in better student understanding of the problem as well as better test suites to accompany the program, both of which are desirable educational outcomes. Inquiry: Producing mutant implementations requires care. If there are too many, or they are too obscure, students will end up spending a lot of time on an unproductive task and also become frustrated. Instead, we want a small number of mutants that each correspond to common problem misconceptions. This paper presents a workflow with partial automation to produce mutants of this form which, notably, are not those produced by mutation-testing tools. Approach: We comb through student tests that fail a correct implementation. The student misconceptions are embedded in these failures. We then use methods to semantically cluster these failures. These clusters are then translated into conceptual mutants. These can then be run against student data to determine whether we they are better than prior methods. Some of these processes also enjoy automation. Knowledge: We find that student misconceptions illustrated by failing tests can be operationalized by the above process. The resulting mutants do much better at identifying student misconceptions. Grounding: Our findings are grounded in a manual analysis of student examples and a quantitative evaluation of both our clustering techniques and our process for making conceptual mutants. The clustering evaluation compares against a ground truth using standard cluster-correspondence measures, while the mutant evaluation examines how conceptual mutants perform against student data. Importance: Our work contributes a workflow, with some automation, to reduce the cost and increase the effectiveness of generating conceptually interesting mutants. Such mutants can both improve learning outcomes and reduce student frustration, leading to better educational outcomes. In the process, we also identify a variation of mutation testing not commonly discussed in the software literature. △ Less

Submitted 28 December, 2023; originally announced January 2024.

Journal ref: The Art, Science, and Engineering of Programming, 2024, Vol. 8, Issue 2, Article 7

arXiv:2211.01677 [pdf]

doi 10.22152/programming-journal.org/2023/7/7

Little Tricky Logic: Misconceptions in the Understanding of LTL

Authors: Ben Greenman, Sam Saarinen, Tim Nelson, Shriram Krishnamurthi

Abstract: Context: Linear Temporal Logic (LTL) has been used widely in verification. Its importance and popularity have only grown with the revival of temporal logic synthesis, and with new uses of LTL in robotics and planning activities. All these uses demand that the user have a clear understanding of what an LTL specification means. Inquiry: Despite the growing use of LTL, no studies have investigated… ▽ More Context: Linear Temporal Logic (LTL) has been used widely in verification. Its importance and popularity have only grown with the revival of temporal logic synthesis, and with new uses of LTL in robotics and planning activities. All these uses demand that the user have a clear understanding of what an LTL specification means. Inquiry: Despite the growing use of LTL, no studies have investigated the misconceptions users actually have in understanding LTL formulas. This paper addresses the gap with a first study of LTL misconceptions. Approach: We study researchers' and learners' understanding of LTL in four rounds (three written surveys, one talk-aloud) spread across a two-year timeframe. Concretely, we decompose "understanding LTL" into three questions. A person reading a spec needs to understand what it is saying, so we study the map** from LTL to English. A person writing a spec needs to go in the other direction, so we study English to LTL. However, misconceptions could arise from two sources: a misunderstanding of LTL's syntax or of its underlying semantics. Therefore, we also study the relationship between formulas and specific traces. Knowledge: We find several misconceptions that have consequences for learners, tool builders, and designers of new property languages. These findings are already resulting in changes to the Alloy modeling language. We also find that the English to LTL direction was the most common source of errors; unfortunately, this is the critical "authoring" direction in which a subtle mistake can lead to a faulty system. We contribute study instruments that are useful for training learners (whether academic or industrial) who are getting acquainted with LTL, and we provide a code book to assist in the analysis of responses to similar-style questions. Grounding: Our findings are grounded in the responses to our survey rounds. Round 1 used Quizius to identify misconceptions among learners in a way that reduces the threat of expert blind spots. Rounds 2 and 3 confirm that both additional learners and researchers (who work in formal methods, robotics, and related fields) make similar errors. Round 4 adds deep support for our misconceptions via talk-aloud surveys. Importance This work provides useful answers to two critical but unexplored questions: in what ways is LTL tricky and what can be done about it? Our survey instruments can serve as a starting point for other studies. △ Less

Submitted 3 November, 2022; originally announced November 2022.

Journal ref: The Art, Science, and Engineering of Programming, 2023, Vol. 7, Issue 2, Article 7

arXiv:2206.13831 [pdf]

doi 10.22152/programming-journal.org/2023/7/2

Gradual Soundness: Lessons from Static Python

Authors: Kuang-Chen Lu, Ben Greenman, Carl Meyer, Dino Viehland, Aniket Panse, Shriram Krishnamurthi

Abstract: Context: Gradually-typed languages allow typed and untyped code to interoperate, but typically come with significant drawbacks. In some languages, the types are unreliable; in others, communication across type boundaries can be extremely expensive; and still others allow only limited forms of interoperability. The research community is actively seeking a sound, fast, and expressive approach to gra… ▽ More Context: Gradually-typed languages allow typed and untyped code to interoperate, but typically come with significant drawbacks. In some languages, the types are unreliable; in others, communication across type boundaries can be extremely expensive; and still others allow only limited forms of interoperability. The research community is actively seeking a sound, fast, and expressive approach to gradual ty**. Inquiry: This paper describes Static Python, a language developed by engineers at Instagram that has proven itself sound, fast, and reasonably expressive in production. Static Python's approach to gradual types is essentially a programmer-tunable combination of the concrete and transient approaches from the literature. Concrete types provide full soundness and low performance overhead, but impose nonlocal constraints. Transient types are sound in a shallow sense and easier to use; they help to bridge the gap between untyped code and typed concrete code. Approach: We evaluate the language in its current state and develop a model that captures the essence of its approach to gradual types. We draw upon personal communication, bug reports, and the Static Python regression test suite to develop this model. Knowledge: Our main finding is that the gradual soundness that arises from a mix of concrete and transient types is an effective way to lower the maintenance cost of the concrete approach. We also find that method-based JIT technology can eliminate the costs of the transient approach. On a more technical level, this paper describes two contributions: a model of Static Python and a performance evaluation of Static Python. The process of formalization found several errors in the implementation, including fatal errors. Grounding: Our model of Static Python is implemented in PLT Redex and tested using property-based soundness tests and 265 tests from the Static Python regression suite. This paper includes a small core of the model to convey the main ideas of the Static Python approach and its soundness. Our performance claims are based on production experience in the Instagram web server. Migrations to Static Python in the server have caused a 3.7\% increase in requests handled per second at maximum CPU load. Importance: Static Python is the first sound gradual language whose piece-meal application to a realistic codebase has consistently improved performance. Other language designers may wish to replicate its approach, especially those who currently maintain unsound gradual languages and are seeking a path to soundness. △ Less

Submitted 28 June, 2022; originally announced June 2022.

Journal ref: The Art, Science, and Engineering of Programming, 2023, Vol. 7, Issue 1, Article 2

arXiv:2111.10412 [pdf]

doi 10.22152/programming-journal.org/2022/6/8

Types for Tables: A Language Design Benchmark

Authors: Kuang-Chen Lu, Ben Greenman, Shriram Krishnamurthi

Abstract: Context: Tables are ubiquitous formats for data. Therefore, techniques for writing correct programs over tables, and debugging incorrect ones, are vital. Our specific focus in this paper is on rich types that articulate the properties of tabular operations. We wish to study both their expressive power and _diagnostic quality_. Inquiry: There is no "standard library" of table operations. As a res… ▽ More Context: Tables are ubiquitous formats for data. Therefore, techniques for writing correct programs over tables, and debugging incorrect ones, are vital. Our specific focus in this paper is on rich types that articulate the properties of tabular operations. We wish to study both their expressive power and _diagnostic quality_. Inquiry: There is no "standard library" of table operations. As a result, every paper (and project) is free to use its own (sub)set of operations. This makes artifacts very difficult to compare, and it can be hard to tell whether omitted operations were left out by oversight or because they cannot actually be expressed. Furthermore, virtually no papers discuss the quality of type error feedback. Approach: We combed through several existing languages and libraries to create a "standard library" of table operations. Each entry is accompanied by a detailed specification of its "type," expressed independent of (and hence not constrained by) any type language. We also studied and categorized a corpus of (student) program edits that resulted in table-related errors. We used this to generate a suite of erroneous programs. Finally, we adapted the concept of a datasheet to facilitate comparisons of different implementations. Knowledge: Our benchmark creates a common ground to frame work in this area. Language designers who claim to support typed programming over tables have a clear suite against which to demonstrate their system's expressive power. Our family of errors also gives them a chance to demonstrate the quality of feedback. Researchers who improve one aspect -- especially error reporting -- without changing the other can demonstrate their improvement, as can those who engage in trade-offs between the two. The net result should be much better science in both expressiveness and diagnostics. We also introduce a datasheet format for presenting this knowledge in a methodical way. Grounding: We have generated our benchmark from real languages, libraries, and programs, as well as personal experience conducting and teaching data science. We have drawn on experience in engineering and, more recently, in data science to generate the datasheet. Importance: Claims about type support for tabular programming are hard to evaluate. However, tabular programming is ubiquitous, and the expressive power of type systems keeps growing. Our benchmark and datasheet can help lead to more orderly science. It also benefits programmers trying to choose a language. △ Less

Submitted 19 November, 2021; originally announced November 2021.

Journal ref: The Art, Science, and Engineering of Programming, 2022, Vol. 6, Issue 2, Article 8

arXiv:2111.10411 [pdf]

doi 10.22152/programming-journal.org/2022/6/9

A Transient Semantics for Typed Racket

Authors: Ben Greenman, Lukas Lazarek, Christos Dimoulas, Matthias Felleisen

Abstract: Mixed-typed languages enable programmers to link typed and untyped components in various ways. Some offer rich type systems to facilitate the smooth migration of untyped code to the typed world; others merely provide a convenient form of type Dynamic together with a conventional structural type system. Orthogonal to this dimension, Natural systems ensure the integrity of types with a sophisticated… ▽ More Mixed-typed languages enable programmers to link typed and untyped components in various ways. Some offer rich type systems to facilitate the smooth migration of untyped code to the typed world; others merely provide a convenient form of type Dynamic together with a conventional structural type system. Orthogonal to this dimension, Natural systems ensure the integrity of types with a sophisticated contract system, while Transient systems insert simple first-order checks at strategic places within typed code. Furthermore, each method of ensuring type integrity comes with its own blame-assignment strategy. Typed Racket has a rich migratory type system and enforces the types with a Natural semantics. Reticulated Python has a simple structural type system extended with Dynamic and enforces types with a Transient semantics. While Typed Racket satisfies the most stringent gradual-type soundness properties at a significant performance cost, Reticulated Python seems to limit the performance penalty to a tolerable degree and is nevertheless type sound. This comparison raises the question of whether Transient checking is applicable to and beneficial for a rich migratory type system. This paper reports on the surprising difficulties of adapting the Transient semantics of Reticulated Python to the rich migratory type system of Typed Racket. The resulting implementation, Shallow Typed Racket, is faster than the standard Deep Typed Racket but only when the Transient blame assignment strategy is disabled. For language designers, this report provides valuable hints on how to equip an existing compiler to support a Transient semantics. For theoreticians, the negative experience with Transient blame calls for a thorough investigation of this strategy. △ Less

Submitted 19 November, 2021; originally announced November 2021.

Journal ref: The Art, Science, and Engineering of Programming, 2022, Vol. 6, Issue 2, Article 9

Showing 1–7 of 7 results for author: Greenman, B