-
EasyView: Bringing Performance Profiles into Integrated Development Environments
Authors:
Qidong Zhao,
Milind Chabbi,
Xu Liu
Abstract:
Dynamic program analysis (also known as profiling) is well-known for its powerful capabilities of identifying performance inefficiencies in software packages. Although a large number of dynamic program analysis techniques are developed in academia and industry, very few of them are widely used by software developers in their regular software develo** activities. There are three major reasons. Fi…
▽ More
Dynamic program analysis (also known as profiling) is well-known for its powerful capabilities of identifying performance inefficiencies in software packages. Although a large number of dynamic program analysis techniques are developed in academia and industry, very few of them are widely used by software developers in their regular software develo** activities. There are three major reasons. First, the dynamic analysis tools (also known as profilers) are disjoint from the coding environments such as IDEs and editors; frequently switching focus between them significantly complicates the entire cycle of software development. Second, mastering various tools to interpret their analysis results requires substantial efforts; even worse, many tools have their own design of graphical user interfaces (GUI) for data presentation, which steepens the learning curves. Third, most existing tools expose few interfaces to support user-defined analysis, which makes the tools less customizable to fulfill diverse user demands. We develop EasyView, a general solution to integrate the interpretation and visualization of various profiling results in the coding environments, which bridges software developers with profilers to provide easy and intuitive dynamic analysis during the code development cycle. The novelty of EasyView is three-fold. First, we develop a generic data format, which enables EasyView to support mainstream profilers for different languages. Second, we develop a set of customizable schemes to analyze and visualize the profiles in intuitive ways. Third, we tightly integrate EasyView with popular coding environments, such as Microsoft Visual Studio Code, with easy code exploration and user interaction. Our evaluation shows that EasyView is able to support various profilers for different languages and provide unique insights into performance inefficiencies in different domains.
△ Less
Submitted 27 December, 2023;
originally announced December 2023.
-
Unveiling and Vanquishing Goroutine Leaks in Enterprise Microservices: A Dynamic Analysis Approach
Authors:
Georgian-Vlad Saioc,
Dmitriy Shirchenko,
Milind Chabbi
Abstract:
Go is a modern programming language gaining popularity in enterprise microservice systems. Concurrency is a first-class citizen in Go with lightweight ``goroutines'' as the building blocks of concurrent execution. Go advocates message-passing to communicate and synchronize among goroutines. Improper use of message passing in Go can result in ``partial deadlocks'' , a subtle concurrency bug where a…
▽ More
Go is a modern programming language gaining popularity in enterprise microservice systems. Concurrency is a first-class citizen in Go with lightweight ``goroutines'' as the building blocks of concurrent execution. Go advocates message-passing to communicate and synchronize among goroutines. Improper use of message passing in Go can result in ``partial deadlocks'' , a subtle concurrency bug where a blocked sender (receiver) never finds a corresponding receiver (sender), causing the blocked goroutine to leak memory, via its call stack and objects reachable from the stack.
In this paper, we systematically study the prevalence of message passing and the resulting partial deadlocks in 75 million lines of Uber's Go monorepo hosting over 2500 microservices. We develop two lightweight, dynamic analysis tools: Goleak and LeakProf, designed to identify partial deadlocks. Goleak detects partial deadlocks during unit testing and prevents the introduction of new bugs. Conversely, LeakProf uses goroutine profiles obtained from services deployed in production to pinpoint intricate bugs arising from complex control flow, unexplored interleavings, or the absence of test coverage. We share our experience and insights deploying these tools in developer workflows in a large industrial setting. Using Goleak we unearthed 857 pre-existing goroutine leaks in the legacy code and prevented the introduction of around 260 new leaks over one year period. Using LeakProf we found 24 and fixed 21 goroutine leaks, which resulted in up to 34% speedup and 9.2x memory reduction in some of our production services.
△ Less
Submitted 19 December, 2023;
originally announced December 2023.
-
Protecting Locks Against Unbalanced Unlock()
Authors:
Vivek Shahare,
Milind Chabbi,
Nikhil Hegde
Abstract:
The lock is a building-block synchronization primitive that enables mutually exclusive access to shared data in shared-memory parallel programs. Mutual exclusion is typically achieved by guarding the code that accesses the shared data with a pair of lock() and unlock() operations. Concurrency bugs arise when this ordering of operations is violated. In this paper, we study a particular pattern of m…
▽ More
The lock is a building-block synchronization primitive that enables mutually exclusive access to shared data in shared-memory parallel programs. Mutual exclusion is typically achieved by guarding the code that accesses the shared data with a pair of lock() and unlock() operations. Concurrency bugs arise when this ordering of operations is violated. In this paper, we study a particular pattern of misuse where an unlock() is issued without first issuing a lock(), which can happen in code with complex control flow. This misuse is surprisingly common in several important open-source repositories we study. We systematically study what happens due to this misuse in several popular locking algorithms. We study how misuse can be detected and how the locking protocols can be fixed to avoid the unwanted consequences of misuse. Most locks require simple changes to detect and prevent this misuse. We evaluate the performance traits of modified implementations, which show mild performance penalties in most scalable locks.
△ Less
Submitted 24 April, 2023;
originally announced April 2023.
-
A Study of Real-World Data Races in Golang
Authors:
Milind Chabbi,
Murali Krishna Ramanathan
Abstract:
The concurrent programming literature is rich with tools and techniques for data race detection. Less, however, has been known about real-world, industry-scale deployment, experience, and insights about data races. Golang (Go for short) is a modern programming language that makes concurrency a first-class citizen. Go offers both message passing and shared memory for communicating among concurrent…
▽ More
The concurrent programming literature is rich with tools and techniques for data race detection. Less, however, has been known about real-world, industry-scale deployment, experience, and insights about data races. Golang (Go for short) is a modern programming language that makes concurrency a first-class citizen. Go offers both message passing and shared memory for communicating among concurrent threads. Go is gaining popularity in modern microservice-based systems. Data races in Go stand in the face of its emerging popularity.
In this paper, using our industrial codebase as an example, we demonstrate that Go developers embrace concurrency and show how the abundance of concurrency alongside language idioms and nuances make Go programs highly susceptible to data races. Google's Go distribution ships with a built-in dynamic data race detector based on ThreadSanitizer. However, dynamic race detectors pose scalability and flakiness challenges; we discuss various software engineering trade-offs to make this detector work effectively at scale. We have deployed this detector in Uber's 46 million lines of Go codebase hosting 2100 distinct microservices, found over 2000 data races, and fixed over 1000 data races, spanning 790 distinct code patches submitted by 210 unique developers over a six-month period. Based on a detailed investigation of these data race patterns in Go, we make seven high-level observations relating to the complex interplay between the Go language paradigm and data races.
△ Less
Submitted 5 April, 2022; v1 submitted 2 April, 2022;
originally announced April 2022.
-
OJXPerf: Featherlight Object Replica Detection for Java Programs
Authors:
Bolun Li,
Hao Xu,
Qidong Zhao,
Pengfei Su,
Milind Chabbi,
Shuyin Jiao,
Xu Liu
Abstract:
Memory bloat is an important source of inefficiency in complex production software, especially in software written in managed languages such as Java. Prior approaches to this problem have focused on identifying objects that outlive their life span. Few studies have, however, looked into whether and to what extent myriad objects of the same type are identical. A quantitative assessment of identical…
▽ More
Memory bloat is an important source of inefficiency in complex production software, especially in software written in managed languages such as Java. Prior approaches to this problem have focused on identifying objects that outlive their life span. Few studies have, however, looked into whether and to what extent myriad objects of the same type are identical. A quantitative assessment of identical objects with code-level attribution can assist developers in refactoring code to eliminate object bloat, and favor reuse of existing object(s). The result is reduced memory pressure, reduced allocation and garbage collection, enhanced data locality, and reduced re-computation, all of which result in superior performance.
We develop OJXPerf, a lightweight sampling-based profiler, which probabilistically identifies identical objects. OJXPerf employs hardware performance monitoring units (PMU) in conjunction with hardware debug registers to sample and compare field values of different objects of the same type allocated at the same calling context but potentially accessed at different program points. The result is a lightweight measurement, a combination of object allocation contexts and usage contexts ordered by duplication frequency. This class of duplicated objects is relatively easier to optimize. OJXPerf incurs 9% runtime and 6% memory overheads on average. We empirically show the benefit of OJXPerf by using its profiles to instruct us to optimize a number of Java programs, including well-known benchmarks and real-world applications. The results show a noticeable reduction in memory usage (up to 11%) and a significant speedup (up to 25%).
△ Less
Submitted 23 March, 2022;
originally announced March 2022.
-
Optimistic Concurrency Control for Real-world Go Programs (Extended Version with Appendix)
Authors:
Zhizhou Zhang,
Milind Chabbi,
Adam Welc,
Timothy Sherwood
Abstract:
We present a source-to-source transformation framework, GOCC, that consumes lock-based pessimistic concurrency programs in the Go language and transforms them into optimistic concurrency programs that use Hardware Transactional Memory (HTM). The choice of the Go language is motivated by the fact that concurrency is a first-class citizen in Go, and it is widely used in Go programs. GOCC performs ri…
▽ More
We present a source-to-source transformation framework, GOCC, that consumes lock-based pessimistic concurrency programs in the Go language and transforms them into optimistic concurrency programs that use Hardware Transactional Memory (HTM). The choice of the Go language is motivated by the fact that concurrency is a first-class citizen in Go, and it is widely used in Go programs. GOCC performs rich inter-procedural program analysis to detect and filter lock-protected regions and performs AST-level code transformation of the surrounding locks when profitable. Profitability is driven by both static analyses of critical sections and dynamic analysis via execution profiles. A custom HTM library, using perceptron, learns concurrency behavior and dynamically decides whether to use HTM in the rewritten lock/unlock points. Given the rich history of transactional memory research but its lack of adoption in any industrial setting, we believe this workflow, which ultimately produces source-code patches, is more apt for industry-scale adoption. Results on widely adopted Go libraries and applications demonstrate significant (up to 10x) and scalable performance gains resulting from our automated transformation while avoiding major performance regressions.
△ Less
Submitted 3 June, 2021;
originally announced June 2021.
-
DJXPerf: Identifying Memory Inefficiencies via Object-centric Profiling for Java
Authors:
Bolun Li,
Pengfei Su,
Milind Chabbi,
Shuyin Jiao,
Xu Liu
Abstract:
Java is the "go-to" programming language choice for develo** scalable enterprise cloud applications. In such systems, even a few percent CPU time savings can offer a significant competitive advantage and cost saving. Although performance tools abound in Java, those that focus on the data locality in the memory hierarchy are rare.
In this paper, we present DJXPerf, a lightweight, object-centric…
▽ More
Java is the "go-to" programming language choice for develo** scalable enterprise cloud applications. In such systems, even a few percent CPU time savings can offer a significant competitive advantage and cost saving. Although performance tools abound in Java, those that focus on the data locality in the memory hierarchy are rare.
In this paper, we present DJXPerf, a lightweight, object-centric memory profiler for Java, which associates memory-hierarchy performance metrics (e.g., cache/TLB misses) with Java objects. DJXPerf uses statistical sampling of hardware performance monitoring counters to attribute metrics to not only source code locations but also Java objects. DJXPerf presents Java object allocation contexts combined with their usage contexts and presents them ordered by the poor locality behaviors. DJXPerf's performance measurement, object attribution, and presentation techniques guide optimizing object allocation, layout, and access patterns. DJXPerf incurs only ~8% runtime overhead and ~5% memory overhead on average, requiring no modifications to hardware, OS, Java virtual machine, or application source code, which makes it attractive to use in production. Guided by DJXPerf, we study and optimize a number of Java and Scala programs, including well-known benchmarks and real-world applications, and demonstrate significant speedups.
△ Less
Submitted 7 April, 2021;
originally announced April 2021.
-
Pinpointing Performance Inefficiencies in Java
Authors:
Pengfei Su,
Qingsen Wang,
Milind Chabbi,
Xu Liu
Abstract:
Many performance inefficiencies such as inappropriate choice of algorithms or data structures, developers' inattention to performance, and missed compiler optimizations show up as wasteful memory operations. Wasteful memory operations are those that produce/consume data to/from memory that may have been avoided. We present, JXPerf, a lightweight performance analysis tool for pinpointing wasteful m…
▽ More
Many performance inefficiencies such as inappropriate choice of algorithms or data structures, developers' inattention to performance, and missed compiler optimizations show up as wasteful memory operations. Wasteful memory operations are those that produce/consume data to/from memory that may have been avoided. We present, JXPerf, a lightweight performance analysis tool for pinpointing wasteful memory operations in Java programs. Traditional byte-code instrumentation for such analysis (1) introduces prohibitive overheads and (2) misses inefficiencies in machine code generation. JXPerf overcomes both of these problems. JXPerf uses hardware performance monitoring units to sample memory locations accessed by a program and uses hardware debug registers to monitor subsequent accesses to the same memory. The result is a lightweight measurement at machine-code level with attribution of inefficiencies to their provenance: machine and source code within full calling contexts. JXPerf introduces only 7% runtime overhead and 7% memory overhead making it useful in production. Guided by JXPerf, we optimize several Java applications by improving code generation and choosing superior data structures and algorithms, which yield significant speedups.
△ Less
Submitted 28 June, 2019;
originally announced June 2019.
-
Redundant Loads: A Software Inefficiency Indicator
Authors:
Pengfei Su,
Shasha Wen,
Hailong Yang,
Milind Chabbi,
Xu Liu
Abstract:
Modern software packages have become increasingly complex with millions of lines of code and references to many external libraries. Redundant operations are a common performance limiter in these code bases. Missed compiler optimization opportunities, inappropriate data structure and algorithm choices, and developers' inattention to performance are some common reasons for the existence of redundant…
▽ More
Modern software packages have become increasingly complex with millions of lines of code and references to many external libraries. Redundant operations are a common performance limiter in these code bases. Missed compiler optimization opportunities, inappropriate data structure and algorithm choices, and developers' inattention to performance are some common reasons for the existence of redundant operations. Developers mainly depend on compilers to eliminate redundant operations. However, compilers' static analysis often misses optimization opportunities due to ambiguities and limited analysis scope; automatic optimizations to algorithmic and data structural problems are out of scope.
We develop LoadSpy, a whole-program profiler to pinpoint redundant memory load operations, which are often a symptom of many redundant operations. The strength of LoadSpy exists in identifying and quantifying redundant load operations in programs and associating the redundancies with program execution contexts and scopes to focus developers' attention on problematic code. LoadSpy works on fully optimized binaries, adopts various optimization techniques to reduce its overhead, and provides a rich graphic user interface, which make it a complete developer tool. Applying LoadSpy showed that a large fraction of redundant loads is common in modern software packages despite highest levels of automatic compiler optimizations. Guided by LoadSpy, we optimize several well-known benchmarks and real-world applications, yielding significant speedups.
△ Less
Submitted 14 February, 2019;
originally announced February 2019.
-
Language Modeling at Scale
Authors:
Mostofa Patwary,
Milind Chabbi,
Heewoo Jun,
Jiaji Huang,
Gregory Diamos,
Kenneth Church
Abstract:
We show how Zipf's Law can be used to scale up language modeling (LM) to take advantage of more training data and more GPUs. LM plays a key role in many important natural language applications such as speech recognition and machine translation. Scaling up LM is important since it is widely accepted by the community that there is no data like more data. Eventually, we would like to train on terabyt…
▽ More
We show how Zipf's Law can be used to scale up language modeling (LM) to take advantage of more training data and more GPUs. LM plays a key role in many important natural language applications such as speech recognition and machine translation. Scaling up LM is important since it is widely accepted by the community that there is no data like more data. Eventually, we would like to train on terabytes (TBs) of text (trillions of words). Modern training methods are far from this goal, because of various bottlenecks, especially memory (within GPUs) and communication (across GPUs). This paper shows how Zipf's Law can address these bottlenecks by grou** parameters for common words and character sequences, because $U \ll N$, where $U$ is the number of unique words (types) and $N$ is the size of the training set (tokens). For a local batch size $K$ with $G$ GPUs and a $D$-dimension embedding matrix, we reduce the original per-GPU memory and communication asymptotic complexity from $Θ(GKD)$ to $Θ(GK + UD)$. Empirically, we find $U \propto (GK)^{0.64}$ on four publicly available large datasets. When we scale up the number of GPUs to 64, a factor of 8, training time speeds up by factors up to 6.7$\times$ (for character LMs) and 6.3$\times$ (for word LMs) with negligible loss of accuracy. Our weak scaling on 192 GPUs on the Tieba dataset shows a 35\% improvement in LM prediction accuracy by training on 93 GB of data (2.5$\times$ larger than publicly available SOTA dataset), but taking only 1.25$\times$ increase in training time, compared to 3 GB of the same dataset running on 6 GPUs.
△ Less
Submitted 23 October, 2018;
originally announced October 2018.
-
Correctness of Hierarchical MCS Locks with Timeout
Authors:
Milind Chabbi,
Abdelhalim Amer,
Shasha Wen,
Xu Liu
Abstract:
This manuscript serves as a correctness proof of the Hierarchical MCS locks with Timeout (HMCS-T) described in our paper titled "An Efficient Abortable-locking Protocol for Multi-level NUMA Systems" appearing in the proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.
HMCS-T is a very involved protocol. The system is stateful; the values of prior acqu…
▽ More
This manuscript serves as a correctness proof of the Hierarchical MCS locks with Timeout (HMCS-T) described in our paper titled "An Efficient Abortable-locking Protocol for Multi-level NUMA Systems" appearing in the proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming.
HMCS-T is a very involved protocol. The system is stateful; the values of prior acquisition efforts affect the subsequent acquisition efforts. Also, the status of successors, predecessors, ancestors, and descendants affect steps followed by the protocol. The ability to make the protocol fully non-blocking leads to modifications to the \texttt{next} field, which causes deviation from the original MCS lock protocol both in acquisition and release. At several places, unconditional field updates are replaced with SWAP or CAS operations.
We follow a multi-step approach to prove the correctness of HMCS-T. To demonstrate the correctness of the HMCS-T lock, we use the Spin model checking. Model checking causes a combinatorial explosion even to simulate a handful of threads. First, we understand the minimal, sufficient configurations necessary to prove safety properties of a single level of lock in the tree. We construct HMCS-T locks that represent these configurations. We model check these configurations, which proves the correctness of components of an HMCS-T lock. Finally, building upon these facts, we argue logically for the correctness of HMCS-T<n>.
△ Less
Submitted 30 December, 2016;
originally announced December 2016.