Search | arXiv e-print repository

doi 10.1109/IWSC60764.2023.00010

Using Ensemble Inference to Improve Recall of Clone Detection

Authors: Gul Aftab Ahmed, James Vincent Patten, Yuanhua Han, Guoxian Lu, David Gregg, Jim Buckley, Muslim Chochlov

Abstract: Large-scale source-code clone detection is a challenging task. In our previous work, we proposed an approach (SSCD) that leverages artificial neural networks and approximates nearest neighbour search to effectively and efficiently locate clones in large-scale bodies of code, in a time-efficient manner. However, our literature review suggests that the relative efficacy of differing neural network m… ▽ More Large-scale source-code clone detection is a challenging task. In our previous work, we proposed an approach (SSCD) that leverages artificial neural networks and approximates nearest neighbour search to effectively and efficiently locate clones in large-scale bodies of code, in a time-efficient manner. However, our literature review suggests that the relative efficacy of differing neural network models has not been assessed in the context of large-scale clone detection approaches. In this work, we aim to assess several such models individually, in terms of their potential to maximize recall, while preserving a high level of precision during clone detection. We investigate if ensemble inference (in this case, using the results of more than one of these neural network models in combination) can further assist in this task. To assess this, we employed four state-of-the-art neural network models and evaluated them individually/in combination. The results, on an illustrative dataset of approximately 500K lines of C/C++ code, suggest that ensemble inference outperforms individual models in all trialled cases, when recall is concerned. Of individual models, the ADA model (belonging to the ChatGPT family of models) has the best performance. However commercial companies may not be prepared to hand their proprietary source code over to the cloud, as required by that approach. Consequently, they may be more interested in an ensemble-combination of CodeBERT-based and CodeT5 models, resulting in similar (if slightly lesser) recall and precision results. △ Less

Submitted 12 February, 2024; originally announced February 2024.

Journal ref: 2023 IEEE 17th International Workshop on Software Clones (IWSC)

arXiv:2309.02182 [pdf, other]

doi 10.1109/ICSME55016.2022.00080

Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone Detection

Authors: Muslim Chochlov, Gul Aftab Ahmed, James Vincent Patten, Guoxian Lu, Wei Hou, David Gregg, Jim Buckley

Abstract: Code clones can detrimentally impact software maintenance and manually detecting them in very large codebases is impractical. Additionally, automated approaches find detection of Type 3 and Type 4 (inexact) clones very challenging. While the most recent artificial deep neural networks (for example BERT-based artificial neural networks) seem to be highly effective in detecting such clones, their pa… ▽ More Code clones can detrimentally impact software maintenance and manually detecting them in very large codebases is impractical. Additionally, automated approaches find detection of Type 3 and Type 4 (inexact) clones very challenging. While the most recent artificial deep neural networks (for example BERT-based artificial neural networks) seem to be highly effective in detecting such clones, their pairwise comparison of every code pair in the target system(s) is inefficient and scales poorly on large codebases. We therefore introduce SSCD, a BERT-based clone detection approach that targets high recall of Type 3 and Type 4 clones at scale (in line with our industrial partner's requirements). It does so by computing a representative embedding for each code fragment and finding similar fragments using a nearest neighbour search. SSCD thus avoids the pairwise-comparison bottleneck of other Neural Network approaches while also using parallel, GPU-accelerated search to tackle scalability. This paper details the approach and an empirical assessment towards configuring and evaluating that approach in industrial setting. The configuration analysis suggests that shorter input lengths and text-only based neural network models demonstrate better efficiency in SSCD, while only slightly decreasing effectiveness. The evaluation results suggest that SSCD is more effective than state-of-the-art approaches like SAGA and SourcererCC. It is also highly efficient: in its optimal setting, SSCD effectively locates clones in the entire 320 million LOC BigCloneBench (a standard clone detection benchmark) in just under three hours. △ Less

Submitted 5 September, 2023; originally announced September 2023.

Comments: 10 pages, 2 figures, 38th IEEE International Conference on Software Maintenance and Evolution

arXiv:2303.15199 [pdf, other]

Maple: A Processing Element for Row-Wise Product Based Sparse Tensor Accelerators

Authors: Midia Reshadi, David Gregg

Abstract: Sparse tensor computing is a core computational part of numerous applications in areas such as data science, graph processing, and scientific computing. Sparse tensors offer the potential of skip** unnecessary computations caused by zero values. In this paper, we propose a new strategy for extending row-wise product sparse tensor accelerators. We propose a new processing element called Maple tha… ▽ More Sparse tensor computing is a core computational part of numerous applications in areas such as data science, graph processing, and scientific computing. Sparse tensors offer the potential of skip** unnecessary computations caused by zero values. In this paper, we propose a new strategy for extending row-wise product sparse tensor accelerators. We propose a new processing element called Maple that uses multiple multiply-accumulate (MAC) units to exploit local clusters of non-zero values to increase parallelism and reduce data movement. Maple works on the compressed sparse row (CSR) format and calculates only non-zero elements of the input matrices based on the sparsity pattern. Furthermore, we may employ Maple as a basic building block in a variety of spatial tensor accelerators that operate based on a row-wise product approach. As a proof of concept, we utilize Maple in two reference accelerators: Extensor and Matraptor. Our experiments show that using Maple in Matraptor and Extensor achieves 50% and 60% energy benefit and 15% and 22% speedup over the baseline designs, respectively. Employing Maple also results in 5.9x and 15.5x smaller area consumption in Matraptor and Extensor compared with the baseline structures, respectively. △ Less

Submitted 27 March, 2023; originally announced March 2023.

arXiv:2302.10806 [pdf, other]

Dynamic Resource Partitioning for Multi-Tenant Systolic Array Based DNN Accelerator

Authors: Midia Reshadi, David Gregg

Abstract: Deep neural networks (DNN) have become significant applications in both cloud-server and edge devices. Meanwhile, the growing number of DNNs on those platforms raises the need to execute multiple DNNs on the same device. This paper proposes a dynamic partitioning algorithm to perform concurrent processing of multiple DNNs on a systolic-array-based accelerator. Sharing an accelerator's storage and… ▽ More Deep neural networks (DNN) have become significant applications in both cloud-server and edge devices. Meanwhile, the growing number of DNNs on those platforms raises the need to execute multiple DNNs on the same device. This paper proposes a dynamic partitioning algorithm to perform concurrent processing of multiple DNNs on a systolic-array-based accelerator. Sharing an accelerator's storage and processing resources across multiple DNNs increases resource utilization and reduces computation time and energy consumption. To this end, we propose a partitioned weight stationary dataflow with a minor modification in the logic of the processing element. We evaluate the energy consumption and computation time with both heavy and light workloads. Simulation results show a 35% and 62% improvement in energy consumption and 56% and 44% in computation time under heavy and light workloads, respectively, compared with single tenancy. △ Less

Submitted 21 February, 2023; originally announced February 2023.

arXiv:2301.05335 [pdf, other]

doi 10.3847/1538-4365/accea7

HST Low Resolution Stellar Library

Authors: Tathagata Pal, Islam Khan, Guy Worthey, Michael D. Gregg, David R. Silva

Abstract: Hubble Space Telescope's (HST) Space Telescope Imaging Spectrograph (STIS) targeted 556 stars in a long-running program called Next Generation Spectral Library (NGSL) via proposals GO9088, GO9786, GO10222, and GO13776. Exposures through three low resolution gratings provide wavelength coverage from 0.2 $< λ<$ 1 $μ$m at $λ/Δλ\sim$ 1000, providing unique coverage in the ultraviolet (UV). The UV grat… ▽ More Hubble Space Telescope's (HST) Space Telescope Imaging Spectrograph (STIS) targeted 556 stars in a long-running program called Next Generation Spectral Library (NGSL) via proposals GO9088, GO9786, GO10222, and GO13776. Exposures through three low resolution gratings provide wavelength coverage from 0.2 $< λ<$ 1 $μ$m at $λ/Δλ\sim$ 1000, providing unique coverage in the ultraviolet (UV). The UV grating (G230LB) scatters red light and this results in unwanted flux that becomes especially troubling for cool stars. We applied scattered light corrections based on \cite{2022stis.rept....5W} and flux corrections arising from pointing errors relative to the center of the 0\farcs2 slit. We present 514 fully reduced spectra, fluxed, dereddened, and cross-correlated to zero velocity. Because of the broad spectral range, we can simultaneously study H$α$ and Mg II $λ$2800, indicators of chromospheric activity. Their behaviours are decoupled. Besides three cool dwarfs and one giant with mild flares in H$α$, only Be stars show strong H$α$ emission. Mg2800 emission, however, strongly anti-correlates with temperature such that warm stars show absorption and stars cooler than $5000 \: \! \rm{K}$ universally show chromospheric emission regardless of dwarf/giant status or metallicity. Transformed to Mg2800 flux emerging from the stellar surface, we find a correlation with temperature with approximately symmetric astrophysical scatter, in contrast to other workers who find a basal level with asymmetric scatter to strong values. Unsurprisingly, we confirm that Mg2800 activity is variable. △ Less

Submitted 18 April, 2023; v1 submitted 12 January, 2023; originally announced January 2023.

Comments: 19 pages, 20 figures, 3 tables. Full version of table 3 is available online at https://archive.stsci.edu/hlsp/lowlib

arXiv:2212.05482 [pdf, other]

doi 10.1051/0004-6361/202243154

The globular cluster system of the nearest Seyfert II galaxy Circinus

Authors: C. Obasi, M. Gómez, D. Minniti, J. Alonso-García, M. Hempel, J. B. Pullen, M. D. Gregg, L. D. Baravalle, M. V. Alonso, B. I. Okere

Abstract: Context. The globular cluster (GC) system of Circinus galaxy has not been probed previously partly because of the location of the galaxy at - 3.8$^\circ$ Galactic latitude which suffers severely from interstellar extinction, stellar crowding, and Galactic foreground contamination. However, the deep near-infrared (NIR) photometry by the VISTA Variables in the Via Láctea Extended Survey (VVVX) in co… ▽ More Context. The globular cluster (GC) system of Circinus galaxy has not been probed previously partly because of the location of the galaxy at - 3.8$^\circ$ Galactic latitude which suffers severely from interstellar extinction, stellar crowding, and Galactic foreground contamination. However, the deep near-infrared (NIR) photometry by the VISTA Variables in the Via Láctea Extended Survey (VVVX) in combination with the precise astrometry of Gaia EDR3 allow us to map GCs in this region. Aims. Our long-term goal is to study and characterise the distributions of GCs and Ultra-compact dwarfs of Circinus galaxy which is the nearest Seyfert II galaxy. Here we conduct the first pilot search for GCs in this galaxy. Methods. We use NIR VVVX photometry in combination with Gaia EDR3 astrometric features such as astrometric excess noise and BP/RP excess factor to build the first homogeneous catalogue of GCs in Circinus galaxy. A robust combination of selection criteria allows us to effectively clean interlopers from our sample. Results. We report the detection of$\sim$ 70 GC candidates in this galaxy at a 3 $σ$ confidence level. They show a bimodal colour distribution with the blue peak at (G-Ks)$_0$ = 0.985$\pm$0.127 mag with a dispersion of 0.211$\pm$0.091 mag and the red peak at (G-Ks)$_0$ = 1.625$\pm$0.177 mag with a dispersion of 0.482$\pm$0.114 mag. A GC specific frequency (S$_N$) of 1.3$\pm$0.2 was derived for the galaxy, and we estimated a total population of 120$\pm$40 GCs. Based on the projected radial distribution it appears that Circinus has a different distribution of GC candidates than MW and M31. Conclusions. We demonstrate that Circinus galaxy hosts a sizeable number of cluster candidates. This result is the first leap towards understanding the evolution of old stellar clusters in this galaxy. △ Less

Submitted 11 December, 2022; originally announced December 2022.

Comments: 15 pages, 12 figures

Journal ref: A&A 670, A18 (2023)

arXiv:2211.03672 [pdf, other]

doi 10.1109/NorCAS53631.2021.9599862

LOCAL: Low-Complex Map** Algorithm for Spatial DNN Accelerators

Authors: Midia Reshadi, David Gregg

Abstract: Deep neural networks are a promising solution for applications that solve problems based on learning data sets. DNN accelerators solve the processing bottleneck as a domain-specific processor. Like other hardware solutions, there must be exact compatibility between the accelerator and other software components, especially the compiler. This paper presents a LOCAL (Low Complexity map** Algorithm)… ▽ More Deep neural networks are a promising solution for applications that solve problems based on learning data sets. DNN accelerators solve the processing bottleneck as a domain-specific processor. Like other hardware solutions, there must be exact compatibility between the accelerator and other software components, especially the compiler. This paper presents a LOCAL (Low Complexity map** Algorithm) that is favorable to use at the compiler level to perform map** operations in one pass with low computation time and energy consumption. We first introduce a formal definition of the design space in order to define the problem's scope, and then we describe the concept of the LOCAL algorithm. The simulation results show 2x to 38x improvements in execution time with lower energy consumption compared to previous proposed dataflow mechanisms. △ Less

Submitted 7 November, 2022; originally announced November 2022.

arXiv:2210.03777 [pdf, other]

Optimal Energy Sha** Control for a Backdrivable Hip Exoskeleton

Authors: Jiefu Zhang, Jian** Lin, Vamsi Peddinti, Robert D. Gregg

Abstract: Task-dependent controllers widely used in exoskeletons track predefined trajectories, which overly constrain the volitional motion of individuals with remnant voluntary mobility. Energy sha**, on the other hand, provides task-invariant assistance by altering the human body's dynamic characteristics in the closed loop. While human-exoskeleton systems are often modeled using Euler-Lagrange equatio… ▽ More Task-dependent controllers widely used in exoskeletons track predefined trajectories, which overly constrain the volitional motion of individuals with remnant voluntary mobility. Energy sha**, on the other hand, provides task-invariant assistance by altering the human body's dynamic characteristics in the closed loop. While human-exoskeleton systems are often modeled using Euler-Lagrange equations, in our previous work we modeled the system as a port-controlled-Hamiltonian system, and a task-invariant controller was designed for a knee-ankle exoskeleton using interconnection-dam** assignment passivity-based control. In this paper, we extend this framework to design a controller for a backdrivable hip exoskeleton to assist multiple tasks. A set of basis functions that contains information of kinematics is selected and corresponding coefficients are optimized, which allows the controller to provide torque that fits normative human torque for different activities of daily life. Human-subject experiments with two able-bodied subjects demonstrated the controller's capability to reduce muscle effort across different tasks. △ Less

Submitted 25 March, 2023; v1 submitted 7 October, 2022; originally announced October 2022.

arXiv:2209.12365 [pdf, other]

Deep Convolutional Neural Network and Transfer Learning for Locomotion Intent Prediction

Authors: Duong Le, Shihao Cheng, Robert D. Gregg, Maani Ghaffari

Abstract: Powered prosthetic legs must anticipate the user's intent when switching between different locomotion modes (e.g., level walking, stair ascent/descent, ramp ascent/descent). Numerous data-driven classification techniques have demonstrated promising results for predicting user intent, but the performance of these intent prediction models on novel subjects remains undesirable. In other domains (e.g.… ▽ More Powered prosthetic legs must anticipate the user's intent when switching between different locomotion modes (e.g., level walking, stair ascent/descent, ramp ascent/descent). Numerous data-driven classification techniques have demonstrated promising results for predicting user intent, but the performance of these intent prediction models on novel subjects remains undesirable. In other domains (e.g., image classification), transfer learning has improved classification accuracy by using previously learned features from a large dataset (i.e., pre-trained models) and then transferring this learned model to a new task where a smaller dataset is available. In this paper, we develop a deep convolutional neural network with intra-subject (subject-dependent) and inter-subject (subject-independent) validations based on a human locomotion dataset. We then apply transfer learning for the subject-independent model using a small portion (10%) of the data from the left-out subject. We compare the performance of these three models. Our results indicate that the transfer learning (TL) model outperforms the subject-independent (IND) model and is comparable to the subject-dependent (DEP) model (DEP Error: 0.74 $\pm$ 0.002%, IND Error: 11.59 $\pm$ 0.076%, TL Error: 3.57 $\pm$ 0.02% with 10% data). Moreover, as expected, transfer learning accuracy increases with the availability of more data from the left-out subject. We also evaluate the performance of the intent prediction system in various sensor configurations that may be available in a prosthetic leg application. Our results suggest that a thigh IMU on the the prosthesis is sufficient to predict locomotion intent in practice. △ Less

Submitted 25 September, 2022; originally announced September 2022.

arXiv:2205.02131 [pdf, other]

Domino Saliency Metrics: Improving Existing Channel Saliency Metrics with Structural Information

Authors: Kaveena Persand, Andrew Anderson, David Gregg

Abstract: Channel pruning is used to reduce the number of weights in a Convolutional Neural Network (CNN). Channel pruning removes slices of the weight tensor so that the convolution layer remains dense. The removal of these weight slices from a single layer causes mismatching number of feature maps between layers of the network. A simple solution is to force the number of feature map between layers to matc… ▽ More Channel pruning is used to reduce the number of weights in a Convolutional Neural Network (CNN). Channel pruning removes slices of the weight tensor so that the convolution layer remains dense. The removal of these weight slices from a single layer causes mismatching number of feature maps between layers of the network. A simple solution is to force the number of feature map between layers to match through the removal of weight slices from subsequent layers. This additional constraint becomes more apparent in DNNs with branches where multiple channels need to be pruned together to keep the network dense. Popular pruning saliency metrics do not factor in the structural dependencies that arise in DNNs with branches. We propose Domino metrics (built on existing channel saliency metrics) to reflect these structural constraints. We test Domino saliency metrics against the baseline channel saliency metrics on multiple networks with branches. Domino saliency metrics improved pruning rates in most tested networks and up to 25% in AlexNet on CIFAR-10. △ Less

Submitted 19 June, 2022; v1 submitted 4 May, 2022; originally announced May 2022.

arXiv:2205.00155 [pdf, ps, other]

doi 10.48550/arXiv.2205.00155

Real-Time Gait Phase and Task Estimation for Controlling a Powered Ankle Exoskeleton on Extremely Uneven Terrain

Authors: Roberto Leo Medrano, Gray Cortright Thomas, Connor G. Keais, Elliott J. Rouse, Robert D. Gregg

Abstract: Positive biomechanical outcomes have been reported with lower-limb exoskeletons in laboratory settings, but these devices have difficulty delivering appropriate assistance in synchrony with human gait as the task or rate of phase progression change in real-world environments. This paper presents a controller for an ankle exoskeleton that uses a data-driven kinematic model to continuously estimate… ▽ More Positive biomechanical outcomes have been reported with lower-limb exoskeletons in laboratory settings, but these devices have difficulty delivering appropriate assistance in synchrony with human gait as the task or rate of phase progression change in real-world environments. This paper presents a controller for an ankle exoskeleton that uses a data-driven kinematic model to continuously estimate the phase, phase rate, stride length, and ground incline states during locomotion, which enables the real-time adaptation of torque assistance to match human torques observed in a multi-activity database of 10 able-bodied subjects. We demonstrate in live experiments with a new cohort of 10 able-bodied participants that the controller yields phase estimates comparable to the state of the art, while also estimating task variables with similar accuracy to recent machine learning approaches. The implemented controller successfully adapts its assistance in response to changing phase and task variables, both during controlled treadmill trials (N=10, phase RMSE: 4.8 +- 2.4\%) and a real-world stress test with extremely uneven terrain (N=1, phase RMSE: 4.8 +- 2.7\%). △ Less

Submitted 6 October, 2022; v1 submitted 30 April, 2022; originally announced May 2022.

arXiv:2202.07181 [pdf]

Structural and phase evolution in U$_3$Si$_2$ during steam corrosion

Authors: Jiatu Liu, Patrick A. Burr, Joshua T. White, Vanessa K. Peterson, Pranesh Dayal, Christopher Baldwin, Deborah Wakeham, Daniel J. Gregg, Elizabeth S. Sooby, Edward G. Obbard

Abstract: U$_3$Si$_2$ nuclear fuel is corroded in deuterated steam with in situ neutron diffraction. Density functional theory is coupled with rigorous thermodynamic description of the hydride including gas/solid entropy contributions. H absorbs in the 2$b$ interstitial site of U$_3$Si$_2$H$_x$ and moves to 8$j$ for $x\ge 0.5$. Hydriding forces lattice expansion and change in $a/c$ ratio linked to site pref… ▽ More U$_3$Si$_2$ nuclear fuel is corroded in deuterated steam with in situ neutron diffraction. Density functional theory is coupled with rigorous thermodynamic description of the hydride including gas/solid entropy contributions. H absorbs in the 2$b$ interstitial site of U$_3$Si$_2$H$_x$ and moves to 8$j$ for $x\ge 0.5$. Hydriding forces lattice expansion and change in $a/c$ ratio linked to site preference. Rietveld refinement tracks the corrosion reactions at 350-500 °C and preference for the 8j site. Above 375 °C, formation of UO$_2$, U$_3$Si$_5$ and USi$_3$ take place in the grain boundaries and bulk. Hydriding occurs in bulk and precedes other reactions. △ Less

Submitted 14 February, 2022; originally announced February 2022.

arXiv:2201.11409 [pdf, ps, other]

On the RTL Implementation of FINN Matrix Vector Compute Unit

Authors: Syed Asad Alam, David Gregg, Giulio Gambardella, Thomas Preusser, Michaela Blott

Abstract: FPGA-based accelerators are becoming more popular for deep neural network due to the ability to scale performance with increasing degree of specialization with dataflow architectures or custom data types. To reduce the barrier for software engineers and data scientists to adopt FPGAs, C++- and OpenCL-based design entries with high-level synthesis (HLS) have been introduced. They provide higher abs… ▽ More FPGA-based accelerators are becoming more popular for deep neural network due to the ability to scale performance with increasing degree of specialization with dataflow architectures or custom data types. To reduce the barrier for software engineers and data scientists to adopt FPGAs, C++- and OpenCL-based design entries with high-level synthesis (HLS) have been introduced. They provide higher abstraction compared to register-transfer level (RTL)-based design. HLS offers faster development time, better maintainability and more flexibility in code exploration, when evaluating options for multi-dimension tensors, convolutional layers or parallelism. Thus, HLS has been adopted by DNN accelerator generation frameworks such as FINN and hls4ml. In this paper, we present an alternative backend RTL library for FINN. We investigate and evaluate, across a spectrum of design dimensions, an RTL-based implementation versus the original HLS variant. We show that for smaller design parameters, RTL produces significantly smaller circuits. For larger circuits, however, the look-up table (LUT) count of RTL-based design is slightly higher, up to around $15\%$. On the other hand, HLS consistently requires more flip-flops (FFs) (orders-of-magnitude increase) and block RAMs (BRAMs) ($2\times$ more). This also impacts the critical path delay, with RTL producing significantly faster circuits, up to $80\%$. Furthermore, RTL also benefits from at-least a $10\times$ reduction in synthesis time. Finally the results were practically validated using a real-world use case of a multi-layer perceptron (MLP) network used in network intrusion detection. Overall, since HLS frameworks code-generate the hardware design, the benefits of the ease in the design entry is less important as compared to synthesis time reduction togther with resource benefits, this might make the RTL abstraction an attractive alternative. △ Less

Submitted 10 April, 2022; v1 submitted 27 January, 2022; originally announced January 2022.

Comments: 22 pages, 7 tables, 16 figures

ACM Class: B.5.0; B.2.5

arXiv:2201.10369 [pdf, ps, other]

Winograd Convolution for Deep Neural Networks: Efficient Point Selection

Authors: Syed Asad Alam, Andrew Anderson, Barbara Barabasz, David Gregg

Abstract: Convolutional neural networks (CNNs) have dramatically improved the accuracy of tasks such as object recognition, image segmentation and interactive speech systems. CNNs require large amounts of computing resources because ofcomputationally intensive convolution layers. Fast convolution algorithms such as Winograd convolution can greatly reduce the computational cost of these layers at a cost of p… ▽ More Convolutional neural networks (CNNs) have dramatically improved the accuracy of tasks such as object recognition, image segmentation and interactive speech systems. CNNs require large amounts of computing resources because ofcomputationally intensive convolution layers. Fast convolution algorithms such as Winograd convolution can greatly reduce the computational cost of these layers at a cost of poor numeric properties, such that greater savings in computation exponentially increase floating point errors. A defining feature of each Winograd convolution algorithm is a set of real-value points where polynomials are sampled. The choice of points impacts the numeric accuracy of the algorithm, but the optimal set of points for small convolutions remains unknown. Existing work considers only small integers and simple fractions as candidate points. In this work, we propose a novel approach to point selection using points of the form {-1/c , -c, c, 1/c } using the full range of real-valued numbers for c. We show that groups of this form cause cancellations in the Winograd transform matrices that reduce numeric error. We find empirically that the error for different values of c forms a rough curve across the range of real-value numbers hel** to localize the values of c that reduce error and that lower errors can be achieved with non-obvious real-valued evaluation points instead of integers or simple fractions. We study a range of sizes for small convolutions and achieve reduction in error ranging from 2% to around 59% for both 1D and 2D convolution. Furthermore, we identify patterns in cases when we select a subset of our proposed points which will always lead to a lower error. Finally we implement a complete Winograd convolution layer and use it to run deep convolution neural networks on real datasets and show that our proposed points reduce error, ranging from 22% to 63%. △ Less

Submitted 25 January, 2022; originally announced January 2022.

Comments: 19 pages, 3 figures, 9 tables and 32 equations

ACM Class: C.3.2; G.0

arXiv:2110.01562 [pdf, other]

Enhancing Voluntary Motion with Modular, Backdrivable, Powered Hip and Knee Orthoses

Authors: Christopher Nesler, Gray Thomas, Nikhil Divekar, Elliott J. Rouse, Robert D. Gregg

Abstract: Mobility disabilities are prominent in society with wide-ranging detriments to affected individuals. Addressing the specific deficits of individuals within this heterogeneous population requires modular, partial-assist, lower-limb exoskeletons. This paper introduces the Modular Backdrivable Lower-limb Unloading Exoskeleton (M-BLUE), which implements high torque, low mechanical impedance actuators… ▽ More Mobility disabilities are prominent in society with wide-ranging detriments to affected individuals. Addressing the specific deficits of individuals within this heterogeneous population requires modular, partial-assist, lower-limb exoskeletons. This paper introduces the Modular Backdrivable Lower-limb Unloading Exoskeleton (M-BLUE), which implements high torque, low mechanical impedance actuators on commercial orthoses with sheet metal modifications to produce a variety of hip- and/or knee-assisting configurations. Benchtop system identification verifies the desirable backdrive properties of the actuator, and allows for torque prediction within 0.4 Nm. An able-bodied human subject experiment demonstrates that three unilateral configurations of M-BLUE (hip only, knee only, and hip-knee) with a simple gravity compensation controller can reduce muscle EMG readings in a lifting and lowering task relative to the bare condition. Reductions in mean muscular effort and peak muscle activation were seen across the primary squat musculature (excluding biceps femoris), demonstrating the potential to reduce fatigue leading to poor lifting posture. These promising results motivate applications of M-BLUE to additional subject populations such as hip/knee osteoarthritis and geriatric frailty, and the expansion of M-BLUE to bilateral and ankle configurations. △ Less

Submitted 4 October, 2021; originally announced October 2021.

Comments: 8 pages, 7 figures

arXiv:2108.12307 [pdf, other]

doi 10.1038/s41597-021-01057-9

Lower-limb kinematics and kinetics during continuously varying human locomotion

Authors: Emma Reznick, Kyle R. Embry, Ross Neuman, Edgar Bolívar-Nieto, Nicholas P. Fey, Robert D. Gregg

Abstract: Human locomotion involves continuously variable activities including walking, running, and stair climbing over a range of speeds and inclinations as well as sit-stand, walk-run, and walk-stairs transitions. Understanding the kinematics and kinetics of the lower limbs during continuously varying locomotion is fundamental to develo** robotic prostheses and exoskeletons that assist in community amb… ▽ More Human locomotion involves continuously variable activities including walking, running, and stair climbing over a range of speeds and inclinations as well as sit-stand, walk-run, and walk-stairs transitions. Understanding the kinematics and kinetics of the lower limbs during continuously varying locomotion is fundamental to develo** robotic prostheses and exoskeletons that assist in community ambulation. However, available datasets on human locomotion neglect transitions between activities and/or continuous variations in speed and inclination during these activities. This data paper reports a new dataset that includes the lower-limb kinematics and kinetics of ten able-bodied participants walking at multiple inclines ($\pm$ 0, 5, 10 $^{\circ}$) and speeds (0.8, 1, 1.2 m/s), running at multiple speeds (1.8, 2, 2.2, 2.4 m/s), walking and running with constant acceleration ($\pm$ 0.2, 0.5 $\text{m/s^2}$), and stair ascent/descent with multiple stair inclines (20, 25, 30, 35 $^{\circ}$). This dataset also includes sit-stand transitions, walk-run transitions, and walk-stairs transitions. Data were recorded by a Vicon motion capture system and, for applicable tasks, a Bertec instrumented treadmill. △ Less

Submitted 27 August, 2021; originally announced August 2021.

arXiv:2102.06681 [pdf, ps, other]

doi 10.1145/3461699

Low precision logarithmic number systems: Beyond base-2

Authors: Syed Asad Alam, James Garland, David Gregg

Abstract: Logarithmic number systems (LNS) are used to represent real numbers in many applications using a constant base raised to a fixed-point exponent making its distribution exponential. This greatly simplifies hardware multiply, divide and square root. LNS with base-2 is most common, but in this paper we show that for low-precision LNS the choice of base has a significant impact. We make four main co… ▽ More Logarithmic number systems (LNS) are used to represent real numbers in many applications using a constant base raised to a fixed-point exponent making its distribution exponential. This greatly simplifies hardware multiply, divide and square root. LNS with base-2 is most common, but in this paper we show that for low-precision LNS the choice of base has a significant impact. We make four main contributions. First, LNS is not closed under addition and subtraction, so the result is approximate. We show that choosing a suitable base can manipulate the distribution to reduce the average error. Second, we show that low-precision LNS addition and subtraction can be implemented efficiently in logic rather than commonly used ROM lookup tables, the complexity of which can be reduced by an appropriate choice of base. A similar effect is shown where the result of arithmetic has greater precision than the input. Third, where input data from external sources is not expected to be in LNS, we can reduce the conversion error by selecting a LNS base to match the expected distribution of the input. Thus, there is no one base which gives the global optimum, and base selection is a trade-off between different factors. Fourth, we show that circuits realized in LNS require lower area and power consumption for short word lengths. △ Less

Submitted 12 February, 2021; originally announced February 2021.

Comments: 22 pages, 12 figures, 8 tables, conference extension

MSC Class: 65G50 ACM Class: C.m; G.0

Journal ref: Syed Asad Alam, James Garland, and David Gregg. 2021. Low-precision Logarithmic Number Systems: Beyond Base-2. ACM Trans. Archit. Code Optim. 18, 4, Article 47 (December 2021), 25 pages

arXiv:2008.12975 [pdf, other]

doi 10.2140/involve.2022.15.669

Family sizes for complete multipartite graphs

Authors: Danielle Gregg, Thomas W. Mattman, Zachary Porat, George Todd

Abstract: The obstruction set for graphs with knotless embeddings is not known, but a recent paper of Goldberg, Mattman, and Naimi indicates that it is quite large. Almost all known obstructions fall into four Triangle-Y families and they ask if there is an efficient way of finding or estimating the size of such graph families. Inspired by this question, we investigate the family size for complete multipart… ▽ More The obstruction set for graphs with knotless embeddings is not known, but a recent paper of Goldberg, Mattman, and Naimi indicates that it is quite large. Almost all known obstructions fall into four Triangle-Y families and they ask if there is an efficient way of finding or estimating the size of such graph families. Inspired by this question, we investigate the family size for complete multipartite graphs. Aside from three families that appear to grow exponentially, these families stabilize: after a certain point, increasing the number of vertices in a fixed part does not change family size. △ Less

Submitted 30 March, 2021; v1 submitted 29 August, 2020; originally announced August 2020.

Comments: 15 pages, 6 figures, 8 tables v2 - substantial revision including improved estimate of family size of K_n family

MSC Class: Primary 05C10; Secondary 57M15; 05C35

Journal ref: Involve 15 (2022) 669-686

arXiv:2007.06563 [pdf, other]

HOBFLOPS CNNs: Hardware Optimized Bitslice-Parallel Floating-Point Operations for Convolutional Neural Networks

Authors: James Garland, David Gregg

Abstract: Convolutional neural networks (CNNs) are typically trained using 16- or 32-bit floating-point (FP) and researchers show that low-precision floating-point (FP) can be highly effective for inference. Low-precision FP can be implemented in field programmable gate array (FPGA) and application-specific integrated circuit (ASIC) accelerators, but existing processors do not generally support custom preci… ▽ More Convolutional neural networks (CNNs) are typically trained using 16- or 32-bit floating-point (FP) and researchers show that low-precision floating-point (FP) can be highly effective for inference. Low-precision FP can be implemented in field programmable gate array (FPGA) and application-specific integrated circuit (ASIC) accelerators, but existing processors do not generally support custom precision FP. We propose hardware optimized bitslice-parallel floating-point operators (HOBFLOPS), a method of generating efficient custom-precision emulated bitslice-parallel software FP arithmetic. We generate custom-precision FP routines optimized using a hardware synthesis design flow to create circuits. We provide standard cell libraries matching the bitwise operations on the target microprocessor architecture, and a code-generator to translate the hardware circuits to bitslice software equivalents. We exploit bitslice parallelism to create a very wide (32-512 element) vectorized convolutional neural network (CNN) convolution. Hardware optimized bitslice-parallel floating-point operators (HOBFLOPS) multiply-accumulate (MAC) performance in CNN convolution on Arm and Intel processors are compared to Berkeley's SoftFP16 equivalent MAC. HOBFLOPS16 outperforms SoftFP16 by 8x on Intel AVX512. HOBFLOPS offers arbitrary-precision FP with custom range and precision e.g., HOBFLOPS9 performs at 6x the performance of HOBFLOPS16 on Arm Neon. HOBFLOPS allows researchers to prototype different levels of custom FP precision in the arithmetic of software CNN accelerators. Furthermore, HOBFLOPS fast custom-precision FP CNNs may be valuable in cases where memory bandwidth is limited. △ Less

Submitted 28 February, 2021; v1 submitted 10 July, 2020; originally announced July 2020.

Comments: 14 pages, 3 tables, 9 figures

arXiv:2006.11967 [pdf, other]

Exploiting Weight Redundancy in CNNs: Beyond Pruning and Quantization

Authors: Yuan Wen, David Gregg

Abstract: Pruning and quantization are proven methods for improving the performance and storage efficiency of convolutional neural networks (CNNs). Pruning removes near-zero weights in tensors and masks weak connections between neurons in neighbouring layers. Quantization reduces the precision of weights by replacing them with numerically similar values that require less storage. In this paper, we identify… ▽ More Pruning and quantization are proven methods for improving the performance and storage efficiency of convolutional neural networks (CNNs). Pruning removes near-zero weights in tensors and masks weak connections between neurons in neighbouring layers. Quantization reduces the precision of weights by replacing them with numerically similar values that require less storage. In this paper, we identify another form of redundancy in CNN weight tensors, in the form of repeated patterns of similar values. We observe that pruning and quantization both tend to drastically increase the number of repeated patterns in the weight tensors. We investigate several compression schemes to take advantage of this structure in CNN weight data, including multiple forms of Huffman coding, and other approaches inspired by block sparse matrix formats. We evaluate our approach on several well-known CNNs and find that we can achieve compaction ratios of 1.4x to 3.1x in addition to the saving from pruning and quantization. △ Less

Submitted 21 June, 2020; originally announced June 2020.

arXiv:2005.10709 [pdf, other]

TASO: Time and Space Optimization for Memory-Constrained DNN Inference

Authors: Yuan Wen, Andrew Anderson, Valentin Radu, Michael F. P. O'Boyle, David Gregg

Abstract: Convolutional neural networks (CNNs) are used in many embedded applications, from industrial robotics and automation systems to biometric identification on mobile devices. State-of-the-art classification is typically achieved by large networks, which are prohibitively expensive to run on mobile and embedded devices with tightly constrained memory and energy budgets. We propose an approach for ahea… ▽ More Convolutional neural networks (CNNs) are used in many embedded applications, from industrial robotics and automation systems to biometric identification on mobile devices. State-of-the-art classification is typically achieved by large networks, which are prohibitively expensive to run on mobile and embedded devices with tightly constrained memory and energy budgets. We propose an approach for ahead-of-time domain specific optimization of CNN models, based on an integer linear programming (ILP) for selecting primitive operations to implement convolutional layers. We optimize the trade-off between execution time and memory consumption by: 1) attempting to minimize execution time across the whole network by selecting data layouts and primitive operations to implement each layer; and 2) allocating an appropriate workspace that reflects the upper bound of memory footprint per layer. These two optimization strategies can be used to run any CNN on any platform with a C compiler. Our evaluation with a range of popular ImageNet neural architectures (GoogleNet, AlexNet, VGG, ResNet and SqueezeNet) on the ARM Cortex-A15 yields speedups of 8x compared to a greedy algorithm based primitive selection, reduces memory requirement by 2.2x while sacrificing only 15% of inference time compared to a solver that considers inference time only. In addition, our optimization approach exposes a range of optimal points for different configurations across the Pareto frontier of memory and latency trade-off, which can be used under arbitrary system constraints. △ Less

Submitted 21 May, 2020; originally announced May 2020.

arXiv:2004.03376 [pdf, other]

doi 10.1109/SSCI47803.2020.9308157

Composition of Saliency Metrics for Channel Pruning with a Myopic Oracle

Authors: Kaveena Persand, Andrew Anderson, David Gregg

Abstract: The computation and memory needed for Convolutional Neural Network (CNN) inference can be reduced by pruning weights from the trained network. Pruning is guided by a pruning saliency, which heuristically approximates the change in the loss function associated with the removal of specific weights. Many pruning signals have been proposed, but the performance of each heuristic depends on the particul… ▽ More The computation and memory needed for Convolutional Neural Network (CNN) inference can be reduced by pruning weights from the trained network. Pruning is guided by a pruning saliency, which heuristically approximates the change in the loss function associated with the removal of specific weights. Many pruning signals have been proposed, but the performance of each heuristic depends on the particular trained network. This leaves the data scientist with a difficult choice. When using any one saliency metric for the entire pruning process, we run the risk of the metric assumptions being invalidated, leading to poor decisions being made by the metric. Ideally we could combine the best aspects of different saliency metrics. However, despite an extensive literature review, we are unable to find any prior work on composing different saliency metrics. The chief difficulty lies in combining the numerical output of different saliency metrics, which are not directly comparable. We propose a method to compose several primitive pruning saliencies, to exploit the cases where each saliency measure does well. Our experiments show that the composition of saliencies avoids many poor pruning choices identified by individual saliencies. In most cases our method finds better selections than even the best individual pruning saliency. △ Less

Submitted 24 June, 2021; v1 submitted 3 April, 2020; originally announced April 2020.

Journal ref: 2020 IEEE Symposium Series on Computational Intelligence (SSCI)

arXiv:2001.02976 [pdf, other]

Performance-Oriented Neural Architecture Search

Authors: Andrew Anderson, **g Su, Rozenn Dahyot, David Gregg

Abstract: Hardware-Software Co-Design is a highly successful strategy for improving performance of domain-specific computing systems. We argue for the application of the same methodology to deep learning; specifically, we propose to extend neural architecture search with information about the hardware to ensure that the model designs produced are highly efficient in addition to the typical criteria around a… ▽ More Hardware-Software Co-Design is a highly successful strategy for improving performance of domain-specific computing systems. We argue for the application of the same methodology to deep learning; specifically, we propose to extend neural architecture search with information about the hardware to ensure that the model designs produced are highly efficient in addition to the typical criteria around accuracy. Using the task of keyword spotting in audio on edge computing devices, we demonstrate that our approach results in neural architecture that is not only highly accurate, but also efficiently mapped to the computing platform which will perform the inference. Using our modified neural architecture search, we demonstrate $0.88\%$ increase in TOP-1 accuracy with $1.85\times$ reduction in latency for keyword spotting in audio on an embedded SoC, and $1.59\times$ on a high-end GPU. △ Less

Submitted 9 January, 2020; originally announced January 2020.

Comments: The 2019 International Conference on High Performance Computing & Simulation

arXiv:1906.04675 [pdf, other]

doi 10.1109/ACCESS.2021.3108545

Taxonomy of Saliency Metrics for Channel Pruning

Authors: Kaveena Persand, Andrew Anderson, David Gregg

Abstract: Pruning unimportant parameters can allow deep neural networks (DNNs) to reduce their heavy computation and memory requirements. A saliency metric estimates which parameters can be safely pruned with little impact on the classification performance of the DNN. Many saliency metrics have been proposed, each within the context of a wider pruning algorithm. The result is that it is difficult to separat… ▽ More Pruning unimportant parameters can allow deep neural networks (DNNs) to reduce their heavy computation and memory requirements. A saliency metric estimates which parameters can be safely pruned with little impact on the classification performance of the DNN. Many saliency metrics have been proposed, each within the context of a wider pruning algorithm. The result is that it is difficult to separate the effectiveness of the saliency metric from the wider pruning algorithm that surrounds it. Similar-looking saliency metrics can yield very different results because of apparently minor design choices. We propose a taxonomy of saliency metrics based on four mostly-orthogonal principal components. We show that a broad range of metrics from the pruning literature can be grouped according to these components. Our taxonomy not only serves as a guide to prior work, but allows us to construct new saliency metrics by exploring novel combinations of our taxonomic components. We perform an in-depth experimental investigation of more than 300 saliency metrics. Our results provide decisive answers to open research questions, and demonstrate the importance of reduction and scaling when pruning groups of weights. We find that some of our constructed metrics can outperform the best existing state-of-the-art metrics for convolutional neural network channel pruning. △ Less

Submitted 4 July, 2021; v1 submitted 11 June, 2019; originally announced June 2019.

Journal ref: IEEE Access, vol. 9, pp. 120110-120126, 2021

arXiv:1905.05233 [pdf, other]

Winograd Convolution for DNNs: Beyond linear polynomials

Authors: Barbara Barabasz, David Gregg

Abstract: Winograd convolution is widely used in deep neural networks (DNNs). Existing work for DNNs considers only the subset Winograd algorithms that are equivalent to Toom-Cook convolution. We investigate a wider range of Winograd algorithms for DNNs and show that these additional algorithms can significantly improve floating point (FP) accuracy in many cases. We present results for three FP formats: fp3… ▽ More Winograd convolution is widely used in deep neural networks (DNNs). Existing work for DNNs considers only the subset Winograd algorithms that are equivalent to Toom-Cook convolution. We investigate a wider range of Winograd algorithms for DNNs and show that these additional algorithms can significantly improve floating point (FP) accuracy in many cases. We present results for three FP formats: fp32, fp16 and bf16 (a truncated form of fp32) using 2000 inputs from the ImageNet dataset. We found that in fp16 this approach gives us up to 6.5 times better image recognition accuracy in one important case while maintaining the same number of elementwise multiplication operations in the innermost loop. In bf16 the convolution can be computed using 5% fewer innermost loop multiplications than with currently used Winograd algorithms while kee** the accuracy of image recognition the same as for direct convolution method. △ Less

Submitted 25 June, 2019; v1 submitted 13 May, 2019; originally announced May 2019.

arXiv:1901.05049 [pdf, other]

doi 10.1145/3403572

Bonseyes AI Pipeline -- bringing AI to you. End-to-end integration of data, algorithms and deployment tools

Authors: Miguel de Prado, **g Su, Rabia Saeed, Lorenzo Keller, Noelia Vallez, Andrew Anderson, David Gregg, Luca Benini, Tim Llewellynn, Nabil Ouerhani, Rozenn Dahyot and, Nuria Pazos

Abstract: Next generation of embedded Information and Communication Technology (ICT) systems are collaborative systems able to perform autonomous tasks. The remarkable expansion of the embedded ICT market, together with the rise and breakthroughs of Artificial Intelligence (AI), have put the focus on the Edge as it stands as one of the keys for the next technological revolution: the seamless integration of… ▽ More Next generation of embedded Information and Communication Technology (ICT) systems are collaborative systems able to perform autonomous tasks. The remarkable expansion of the embedded ICT market, together with the rise and breakthroughs of Artificial Intelligence (AI), have put the focus on the Edge as it stands as one of the keys for the next technological revolution: the seamless integration of AI in our daily life. However, training and deployment of custom AI solutions on embedded devices require a fine-grained integration of data, algorithms, and tools to achieve high accuracy. Such integration requires a high level of expertise that becomes a real bottleneck for small and medium enterprises wanting to deploy AI solutions on the Edge which, ultimately, slows down the adoption of AI on daily-life applications. In this work, we present a modular AI pipeline as an integrating framework to bring data, algorithms, and deployment tools together. By removing the integration barriers and lowering the required expertise, we can interconnect the different stages of tools and provide a modular end-to-end development of AI products for embedded devices. Our AI pipeline consists of four modular main steps: i) data ingestion, ii) model training, iii) deployment optimization and, iv) the IoT hub integration. To show the effectiveness of our pipeline, we provide examples of different AI applications during each of the steps. Besides, we integrate our deployment framework, LPDNN, into the AI pipeline and present its lightweight architecture and deployment capabilities for embedded devices. Finally, we demonstrate the results of the AI pipeline by showing the deployment of several AI applications such as keyword spotting, image classification and object detection on a set of well-known embedded platforms, where LPDNN consistently outperforms all other popular deployment frameworks. △ Less

Submitted 11 June, 2020; v1 submitted 15 January, 2019; originally announced January 2019.

arXiv:1812.04771 [pdf, other]

Robust Optimal Design of Energy Efficient Series Elastic Actuators: Application to a Powered Prosthetic Ankle

Authors: Edgar Bolívar, Siavash Rezazadeh, Tyler Summers, Robert D. Gregg

Abstract: Design of robotic systems that safely and efficiently operate in uncertain operational conditions, such as rehabilitation and physical assistance robots, remains an important challenge in the field. Current methods for the design of energy efficient series elastic actuators use an optimization formulation that typically assumes known operational conditions. This approach could lead to actuators th… ▽ More Design of robotic systems that safely and efficiently operate in uncertain operational conditions, such as rehabilitation and physical assistance robots, remains an important challenge in the field. Current methods for the design of energy efficient series elastic actuators use an optimization formulation that typically assumes known operational conditions. This approach could lead to actuators that cannot perform in uncertain environments because elongation, speed, or torque requirements may be beyond actuator specifications when the operation deviates from its nominal conditions. Addressing this gap, we propose a convex optimization formulation to design the stiffness of series elastic actuators to minimize energy consumption and satisfy actuator constraints despite uncertainty due to manufacturing of the spring, unmodeled dynamics, efficiency of the transmission, and the kinematics and kinetics of the load. In our formulation, we express energy consumption as a scalar convex-quadratic function of compliance. In the unconstrained case, this quadratic equation provides an analytical solution to the optimal value of stiffness that minimizes energy consumption for arbitrary periodic reference trajectories. As actuator constraints, we consider peak motor torque, peak motor velocity, limitations due to the speed-torque relationship of DC motors, and peak elongation of the spring. As a simulation case study, we apply our formulation to the robust design of a series elastic actuator for a powered prosthetic ankle. Our simulation results indicate that a small trade-off between energy efficiency and robustness is justified to design actuators that can operate with uncertainty. △ Less

Submitted 5 February, 2019; v1 submitted 11 December, 2018; originally announced December 2018.

arXiv:1811.05414 [pdf, other]

A Phase Variable Approach for Improved Rhythmic and Non-Rhythmic Control of a Powered Knee-Ankle Prosthesis

Authors: Siavash Rezazadeh, David Quintero, Nikhil Divekar, Emma Reznick, Leslie Gray, Robert D. Gregg

Abstract: Although there has been recent progress in control of multi-joint prosthetic legs for rhythmic tasks such as walking, control of these systems for non-rhythmic motions and general real-world maneuvers is still an open problem. In this article, we develop a new controller that is capable of both rhythmic (constant-speed) walking, transitions between speeds and/or tasks, and some common volitional l… ▽ More Although there has been recent progress in control of multi-joint prosthetic legs for rhythmic tasks such as walking, control of these systems for non-rhythmic motions and general real-world maneuvers is still an open problem. In this article, we develop a new controller that is capable of both rhythmic (constant-speed) walking, transitions between speeds and/or tasks, and some common volitional leg motions. We introduce a new piecewise holonomic phase variable, which, through a finite state machine, forms the basis of our controller. The phase variable is constructed by measuring the thigh angle, and the transitions in the finite state machine are formulated through sensing foot contact along with attributes of a nominal reference gait trajectory. The controller was implemented on a powered knee-ankle prosthesis and tested with a transfemoral amputee subject, who successfully performed a wide range of rhythmic and non-rhythmic tasks, including slow and fast walking, quick start and stop, backward walking, walking over obstacles, and kicking a soccer ball. Use of the powered leg resulted in clinically significant reductions in amputee compensations for rhythmic tasks (including vaulting and hip circumduction) when compared to use of the take-home passive leg. In addition, considerable improvements were also observed in the performance for non-rhythmic tasks. The proposed approach is expected to provide a better understanding of rhythmic and non-rhythmic motions in a unified framework, which in turn can lead to more reliable control of multi-joint prostheses for a wider range of real-world tasks. △ Less

Submitted 4 August, 2019; v1 submitted 13 November, 2018; originally announced November 2018.

arXiv:1809.10572 [pdf, other]

doi 10.1109/ARITH.2019.00018

Scalar Arithmetic Multiple Data: Customizable Precision for Deep Neural Networks

Authors: Andrew Anderson, David Gregg

Abstract: Quantization of weights and activations in Deep Neural Networks (DNNs) is a powerful technique for network compression, and has enjoyed significant attention and success. However, much of the inference-time benefit of quantization is accessible only through the use of customized hardware accelerators or by providing an FPGA implementation of quantized arithmetic. Building on prior work, we show… ▽ More Quantization of weights and activations in Deep Neural Networks (DNNs) is a powerful technique for network compression, and has enjoyed significant attention and success. However, much of the inference-time benefit of quantization is accessible only through the use of customized hardware accelerators or by providing an FPGA implementation of quantized arithmetic. Building on prior work, we show how to construct arbitrary bit-precise signed and unsigned integer operations using a software technique which logically \emph{embeds} a vector architecture with custom bit-width lanes in universally available fixed-width scalar arithmetic. We evaluate our approach on a high-end Intel Haswell processor, and an embedded ARM processor. Our approach yields very fast implementations of bit-precise custom DNN operations, which often match or exceed the performance of operations quantized to the sizes supported in native arithmetic. At the strongest level of quantization, our approach yields a maximum speedup of $\thicksim6\times$ on the Intel platform, and $\thicksim10\times$ on the ARM platform versus quantization to native 8-bit integers. △ Less

Submitted 12 December, 2019; v1 submitted 27 September, 2018; originally announced September 2018.

arXiv:1803.10986 [pdf, other]

Error Analysis and Improving the Accuracy of Winograd Convolution for Deep Neural Networks

Authors: Barbara Barabasz, Andrew Anderson, Kirk M. Soodhalter, David Gregg

Abstract: Popular deep neural networks (DNNs) spend the majority of their execution time computing convolutions. The Winograd family of algorithms can greatly reduce the number of arithmetic operations required and is present in many DNN software frameworks. However, the performance gain is at the expense of a reduction in floating point (FP) numerical accuracy. In this paper, we analyse the worst case FP e… ▽ More Popular deep neural networks (DNNs) spend the majority of their execution time computing convolutions. The Winograd family of algorithms can greatly reduce the number of arithmetic operations required and is present in many DNN software frameworks. However, the performance gain is at the expense of a reduction in floating point (FP) numerical accuracy. In this paper, we analyse the worst case FP error and prove the estimation of norm and conditioning of the algorithm. We show that the bound grows exponentially with the size of the convolution, but the error bound of the \textit{modified} algorithm is smaller than the original one. We propose several methods for reducing FP error. We propose a canonical evaluation ordering based on Huffman coding that reduces summation error. We study the selection of sampling "points" experimentally and find empirically good points for the most important sizes. We identify the main factors associated with good points. In addition, we explore other methods to reduce FP error, including mixed-precision convolution, and pairwise summation across DNN channels. Using our methods we can significantly reduce FP error for a given block size, which allows larger block sizes and reduced computation. △ Less

Submitted 1 May, 2019; v1 submitted 29 March, 2018; originally announced March 2018.

arXiv:1801.10219 [pdf, other]

Low Complexity Multiply-Accumulate Units for Convolutional Neural Networks with Weight-Sharing

Authors: James Garland, David Gregg

Abstract: Convolutional neural networks (CNNs) are one of the most successful machine learning techniques for image, voice and video processing. CNNs require large amounts of processing capacity and memory bandwidth. Hardware accelerators have been proposed for CNNs which typically contain large numbers of multiply-accumulate (MAC) units, the multipliers of which are large in an integrated circuit (IC) gate… ▽ More Convolutional neural networks (CNNs) are one of the most successful machine learning techniques for image, voice and video processing. CNNs require large amounts of processing capacity and memory bandwidth. Hardware accelerators have been proposed for CNNs which typically contain large numbers of multiply-accumulate (MAC) units, the multipliers of which are large in an integrated circuit (IC) gate count and power consumption. "Weight sharing" accelerators have been proposed where the full range of weight values in a trained CNN are compressed and put into bins and the bin index used to access the weight-shared value. We reduce power and area of the CNN by implementing parallel accumulate shared MAC (PASM) in a weight shared CNN. PASM re-architects the MAC to instead count the frequency of each weight and place it in a bin. The accumulated value is computed in a subsequent multiply phase, significantly reducing gate count and power consumption of the CNN. In this paper, we implement PASM in a weight-shared CNN convolution hardware accelerator and analyze its effectiveness. Experiments show that for a clock speed 1GHz implemented on a 45nm ASIC process our approach results in fewer gates, smaller logic, and reduced power with only a slight increase in latency. We also show that the same weight-shared-with-PASM CNN accelerator can be implemented in resource-constrained FPGAs, where the FPGA has limited numbers of digital signal processor (DSP) units to accelerate the MAC operations. △ Less

Submitted 1 May, 2018; v1 submitted 30 January, 2018; originally announced January 2018.

arXiv:1710.01079 [pdf, other]

Optimal DNN Primitive Selection with Partitioned Boolean Quadratic Programming

Authors: Andrew Anderson, David Gregg

Abstract: Deep Neural Networks (DNNs) require very large amounts of computation both for training and for inference when deployed in the field. Many different algorithms have been proposed to implement the most computationally expensive layers of DNNs. Further, each of these algorithms has a large number of variants, which offer different trade-offs of parallelism, data locality, memory footprint, and execu… ▽ More Deep Neural Networks (DNNs) require very large amounts of computation both for training and for inference when deployed in the field. Many different algorithms have been proposed to implement the most computationally expensive layers of DNNs. Further, each of these algorithms has a large number of variants, which offer different trade-offs of parallelism, data locality, memory footprint, and execution time. In addition, specific algorithms operate much more efficiently on specialized data layouts and formats. We state the problem of optimal primitive selection in the presence of data format transformations, and show that it is NP-hard by demonstrating an embedding in the Partitioned Boolean Quadratic Assignment problem (PBQP). We propose an analytic solution via a PBQP solver, and evaluate our approach experimentally by optimizing several popular DNNs using a library of more than 70 DNN primitives, on an embedded platform and a general purpose platform. We show experimentally that significant gains are possible versus the state of the art vendor libraries by using a principled analytic solution to the problem of layout selection in the presence of data format transformations. △ Less

Submitted 2 November, 2018; v1 submitted 3 October, 2017; originally announced October 2017.

arXiv:1709.03395 [pdf, other]

Low-memory GEMM-based convolution algorithms for deep neural networks

Authors: Andrew Anderson, Aravind Vasudevan, Cormac Keane, David Gregg

Abstract: Deep neural networks (DNNs) require very large amounts of computation both for training and for inference when deployed in the field. A common approach to implementing DNNs is to recast the most computationally expensive operations as general matrix multiplication (GEMM). However, as we demonstrate in this paper, there are a great many different ways to express DNN convolution operations using GEM… ▽ More Deep neural networks (DNNs) require very large amounts of computation both for training and for inference when deployed in the field. A common approach to implementing DNNs is to recast the most computationally expensive operations as general matrix multiplication (GEMM). However, as we demonstrate in this paper, there are a great many different ways to express DNN convolution operations using GEMM. Although different approaches all perform the same number of operations, the size of temporary data structures differs significantly. Convolution of an input matrix with dimensions $C \times H \times W$, requires $O(K^2CHW)$ additional space using the classical im2col approach. More recently memory-efficient approaches requiring just $O(KCHW)$ auxiliary space have been proposed. We present two novel GEMM-based algorithms that require just $O(MHW)$ and $O(KW)$ additional space respectively, where $M$ is the number of channels in the result of the convolution. These algorithms dramatically reduce the space overhead of DNN convolution, making it much more suitable for memory-limited embedded systems. Experimental evaluation shows that our low-memory algorithms are just as fast as the best patch-building approaches despite requiring just a fraction of the amount of additional memory. Our low-memory algorithms have excellent data locality which gives them a further edge over patch-building algorithms when multiple cores are used. As a result, our low memory algorithms often outperform the best patch-building algorithms using multiple threads. △ Less

Submitted 8 September, 2017; originally announced September 2017.

Comments: 13 pages, 16 figures and 3 tables. arXiv admin note: text overlap with arXiv:1704.04428

arXiv:1704.08449 [pdf, other]

doi 10.1051/0004-6361/201730582

The full spectral radiative properties of Proxima Centauri

Authors: Ignasi Ribas, Michael D. Gregg, Tabetha S. Boyajian, Emeline Bolmont

Abstract: The discovery of Proxima b, a terrestrial temperate planet, presents the opportunity of studying a potentially habitable world in optimal conditions. A key aspect to model its habitability is to understand the radiation environment of the planet in the full spectral domain. We characterize the X-rays to mid-IR radiative properties of Proxima with the goal of providing the top-of-atmosphere fluxes… ▽ More The discovery of Proxima b, a terrestrial temperate planet, presents the opportunity of studying a potentially habitable world in optimal conditions. A key aspect to model its habitability is to understand the radiation environment of the planet in the full spectral domain. We characterize the X-rays to mid-IR radiative properties of Proxima with the goal of providing the top-of-atmosphere fluxes on the planet. We also aim at constraining the fundamental properties of the star. We employ observations from a large number of facilities and make use of different methodologies to piece together the full spectral energy distribution of Proxima. In the high-energy domain, we pay particular attention to the contribution by rotational modulation, activity cycle, and flares so that the data provided are representative of the overall radiation dose received by the atmosphere of the planet. We present the full spectrum of Proxima covering 0.7 to 30000 nm. The integration of the data shows that the top-of-atmosphere average XUV irradiance on Proxima b is 0.293 W m^-2, i.e., nearly 60 times higher than Earth, and that the total irradiance is 877+/-44 W m^-2, or 64+/-3% of the solar constant but with a significantly redder spectrum. We also provide laws for the XUV evolution of Proxima corresponding to two scenarios. Regarding the fundamental properties of Proxima, we find M=0.120+/-0.003 Msun, R=0.146+/-0.007 Rsun, Teff=2980+/-80 K, and L=0.00151+/-0.00008 Lsun. In addition, our analysis reveals a ~20% excess in the 3-30 micron flux of the star that is best interpreted as arising from warm dust in the system. The data provided here should be useful to further investigate the current atmospheric properties of Proxima b as well as its past history, with the overall aim of firmly establishing the habitability of the planet. △ Less

Submitted 27 April, 2017; originally announced April 2017.

Comments: 12 pages, 5 figures, accepted for publication in Astronomy & Astrophysics

Journal ref: A&A 603, A58 (2017)

arXiv:1704.04428 [pdf, other]

Parallel Multi Channel Convolution using General Matrix Multiplication

Authors: Aravind Vasudevan, Andrew Anderson, David Gregg

Abstract: Convolutional neural networks (CNNs) have emerged as one of the most successful machine learning technologies for image and video processing. The most computationally intensive parts of CNNs are the convolutional layers, which convolve multi-channel images with multiple kernels. A common approach to implementing convolutional layers is to expand the image into a column matrix (im2col) and perform… ▽ More Convolutional neural networks (CNNs) have emerged as one of the most successful machine learning technologies for image and video processing. The most computationally intensive parts of CNNs are the convolutional layers, which convolve multi-channel images with multiple kernels. A common approach to implementing convolutional layers is to expand the image into a column matrix (im2col) and perform Multiple Channel Multiple Kernel (MCMK) convolution using an existing parallel General Matrix Multiplication (GEMM) library. This im2col conversion greatly increases the memory footprint of the input matrix and reduces data locality. In this paper we propose a new approach to MCMK convolution that is based on General Matrix Multiplication (GEMM), but not on im2col. Our algorithm eliminates the need for data replication on the input thereby enabling us to apply the convolution kernels on the input images directly. We have implemented several variants of our algorithm on a CPU processor and an embedded ARM processor. On the CPU, our algorithm is faster than im2col in most cases. △ Less

Submitted 3 July, 2017; v1 submitted 6 April, 2017; originally announced April 2017.

Comments: Camera ready version to be published at ASAP 2017 - The 28th Annual IEEE International Conference on Application-specific Systems, Architectures and Processors. 6 pages

arXiv:1701.08800 [pdf, other]

Mutual Inclusivity of the Critical Path and its Partial Schedule on Heterogeneous Systems

Authors: Aravind Vasudevan, David Gregg

Abstract: The critical path of a group of tasks is an important measure that is commonly used to guide task allocation and scheduling on parallel computers. The critical path is the longest chain of dependencies in an acyclic task dependence graph. A problem arises on heterogeneous parallel machines where computation and communication costs can vary between different types of processor. Existing solutions f… ▽ More The critical path of a group of tasks is an important measure that is commonly used to guide task allocation and scheduling on parallel computers. The critical path is the longest chain of dependencies in an acyclic task dependence graph. A problem arises on heterogeneous parallel machines where computation and communication costs can vary between different types of processor. Existing solutions for heterogeneous machines attempt to estimate the critical path using average values of computation and communication costs. However, this ignores opportunities to match specific tasks to specific classes of processor and communication links, and can result in quite misleading paths being identified as critical. We argue that an accurate critical path must consider the map** of tasks to classes of processor and communication links. We formulate a polynomial time algorithm to find such a critical path. Our Critical Earliest Finish Time (CEFT) algorithm finds both the length of the critical path and an allocation of tasks to processors on that path. We compared CEFT experimentally to existing approaches such as averaging execution times across processors. The latter approach fails to accurately model the execution cost of tasks, and as a result fails to identify a correct critical path in 83.99% of cases in our experiments. We also adapted a critical path-oriented scheduling algorithm (CPOP) to use our critical path algorithm and found that the resulting schedules are faster. △ Less

Submitted 30 January, 2017; originally announced January 2017.

arXiv:1611.05378 [pdf, ps, other]

Spectral Convolution Networks

Authors: Maria Francesca, Arthur Hughes, David Gregg

Abstract: Previous research has shown that computation of convolution in the frequency domain provides a significant speedup versus traditional convolution network implementations. However, this performance increase comes at the expense of repeatedly computing the transform and its inverse in order to apply other network operations such as activation, pooling, and dropout. We show, mathematically, how convo… ▽ More Previous research has shown that computation of convolution in the frequency domain provides a significant speedup versus traditional convolution network implementations. However, this performance increase comes at the expense of repeatedly computing the transform and its inverse in order to apply other network operations such as activation, pooling, and dropout. We show, mathematically, how convolution and activation can both be implemented in the frequency domain using either the Fourier or Laplace transformation. The main contributions are a description of spectral activation under the Fourier transform and a further description of an efficient algorithm for computing both convolution and activation under the Laplace transform. By computing both the convolution and activation functions in the frequency domain, we can reduce the number of transforms required, as well as reducing overall complexity. Our description of a spectral activation function, together with existing spectral analogs of other network functions may then be used to compose a fully spectral implementation of a convolution network. △ Less

Submitted 16 November, 2016; originally announced November 2016.

arXiv:1609.05132 [pdf, other]

doi 10.1109/LCA.2017.2656880

Low Complexity Multiply Accumulate Unit for Weight-Sharing Convolutional Neural Networks

Authors: James Garland, David Gregg

Abstract: Convolutional Neural Networks (CNNs) are one of the most successful deep machine learning technologies for processing image, voice and video data. CNNs require large amounts of processing capacity and memory, which can exceed the resources of low power mobile and embedded systems. Several designs for hardware accelerators have been proposed for CNNs which typically contain large numbers of Multipl… ▽ More Convolutional Neural Networks (CNNs) are one of the most successful deep machine learning technologies for processing image, voice and video data. CNNs require large amounts of processing capacity and memory, which can exceed the resources of low power mobile and embedded systems. Several designs for hardware accelerators have been proposed for CNNs which typically contain large numbers of Multiply Accumulate (MAC) units. One approach to reducing data sizes and memory traffic in CNN accelerators is "weight sharing", where the full range of values in a trained CNN are put in bins and the bin index is stored instead of the original weight value. In this paper we propose a novel MAC circuit that exploits binning in weight-sharing CNNs. Rather than computing the MAC directly we instead count the frequency of each weight and place it in a bin. We then compute the accumulated value in a subsequent multiply phase. This allows hardware multipliers in the MAC circuit to be replaced with adders and selection logic. Experiments show that for the same clock speed our approach results in fewer gates, smaller logic, and reduced power. △ Less

Submitted 19 January, 2017; v1 submitted 30 August, 2016; originally announced September 2016.

Comments: 4 pages

arXiv:1602.04716 [pdf, other]

Customizable Precision of Floating-Point Arithmetic with Bitslice Vector Types

Authors: Shixiong Xu, David Gregg

Abstract: Customizing the precision of data can provide attractive trade-offs between accuracy and hardware resources. We propose a novel form of vector computing aimed at arrays of custom-precision floating point data. We represent these vectors in bitslice format. Bitwise instructions are used to implement arithmetic circuits in software that operate on customized bit-precision. Experiments show that this… ▽ More Customizing the precision of data can provide attractive trade-offs between accuracy and hardware resources. We propose a novel form of vector computing aimed at arrays of custom-precision floating point data. We represent these vectors in bitslice format. Bitwise instructions are used to implement arithmetic circuits in software that operate on customized bit-precision. Experiments show that this approach can be efficient for vectors of low-precision custom floating point types, while providing arbitrary bit precision. △ Less

Submitted 15 February, 2016; originally announced February 2016.

arXiv:1601.07789 [pdf, other]

doi 10.1145/2967938.2967966

Vectorization of Multibyte Floating Point Data Formats

Authors: Andrew Anderson, David Gregg

Abstract: We propose a scheme for reduced-precision representation of floating point data on a continuum between IEEE-754 floating point types. Our scheme enables the use of lower precision formats for a reduction in storage space requirements and data transfer volume. We describe how our scheme can be accelerated using existing hardware vector units on a general-purpose processor (GPP). Exploiting native v… ▽ More We propose a scheme for reduced-precision representation of floating point data on a continuum between IEEE-754 floating point types. Our scheme enables the use of lower precision formats for a reduction in storage space requirements and data transfer volume. We describe how our scheme can be accelerated using existing hardware vector units on a general-purpose processor (GPP). Exploiting native vector hardware allows us to support reduced precision floating point with low overhead. We demonstrate that supporting reduced precision in the compiler as opposed to using a library approach can yield a low overhead solution for GPPs. △ Less

Submitted 22 July, 2016; v1 submitted 26 January, 2016; originally announced January 2016.

ACM Class: D.3.4; G.1.0; B.2.4; I.4.2

arXiv:1508.01753 [pdf, other]

Practical Algorithms for Finding Extremal Sets

Authors: Martin Marinov, Nicholas Nash, David Gregg

Abstract: The minimal sets within a collection of sets are defined as the ones which do not have a proper subset within the collection, and the maximal sets are the ones which do not have a proper superset within the collection. Identifying extremal sets is a fundamental problem with a wide-range of applications in SAT solvers, data-mining and social network analysis. In this paper, we present two novel imp… ▽ More The minimal sets within a collection of sets are defined as the ones which do not have a proper subset within the collection, and the maximal sets are the ones which do not have a proper superset within the collection. Identifying extremal sets is a fundamental problem with a wide-range of applications in SAT solvers, data-mining and social network analysis. In this paper, we present two novel improvements of the high-quality extremal set identification algorithm, \textit{AMS-Lex}, described by Bayardo and Panda. The first technique uses memoization to improve the execution time of the single-threaded variant of the AMS-Lex, whilst our second improvement uses parallel programming methods. In a subset of the presented experiments our memoized algorithm executes more than $400$ times faster than the highly efficient publicly available implementation of AMS-Lex. Moreover, we show that our modified algorithm's speedup is not bounded above by a constant and that it increases as the length of the common prefixes in successive input \textit{itemsets} increases. We provide experimental results using both real-world and synthetic data sets, and show our multi-threaded variant algorithm out-performing AMS-Lex by $3$ to $6$ times. We find that on synthetic input datasets when executed using $16$ CPU cores of a $32$-core machine, our multi-threaded program executes about as fast as the state of the art parallel GPU-based program using an NVIDIA GTX 580 graphics processing unit. △ Less

Submitted 7 August, 2015; originally announced August 2015.

arXiv:1507.05841 [pdf, other]

On the GI-Completeness of a Sorting Networks Isomorphism

Authors: Martin Marinov, David Gregg

Abstract: The subitemset isomorphism problem is really important and there are excellent practical solutions described in the literature. However, the computational complexity analysis and classification of the BZ (Bundala and Zavodny) subitemset isomorphism problem is currently an open problem. In this paper we prove that checking whether two sorting networks are BZ isomorphic to each other is GI-Complete;… ▽ More The subitemset isomorphism problem is really important and there are excellent practical solutions described in the literature. However, the computational complexity analysis and classification of the BZ (Bundala and Zavodny) subitemset isomorphism problem is currently an open problem. In this paper we prove that checking whether two sorting networks are BZ isomorphic to each other is GI-Complete; the general GI (Graph Isomorphism) problem is known to be in NP and LWPP, but widely believed to be neither P nor NP-Complete; recent research suggests that the problem is in QP. Moreover, we state the BZ sorting network isomorphism problem as a general isomorphism problem on itemsets --- because every sorting network is represented by Bundala and Zavodny as an itemset. The complexity classification presented in this paper applies sorting networks, as well as the general itemset isomorphism problem. The main consequence of our work is that currently no polynomial-time algorithm exists for solving the BZ sorting network subitemset isomorphism problem; however the CM (Choi and Moon) sorting network isomorphism problem can be efficiently solved in polynomial time. △ Less

Submitted 19 January, 2016; v1 submitted 21 July, 2015; originally announced July 2015.

ACM Class: F.1.3; F.2.2

arXiv:1502.05983 [pdf, ps, other]

Sorting Networks: The Final Countdown

Authors: Martin Marinov, David Gregg

Abstract: In this paper we extend the knowledge on the problem of empirically searching for sorting networks of minimal depth. We present new search space pruning techniques for the last four levels of a candidate sorting network by considering only the output set representation of a network. We present an algorithm for checking whether an $n$-input sorting network of depth $d$ exists by considering the min… ▽ More In this paper we extend the knowledge on the problem of empirically searching for sorting networks of minimal depth. We present new search space pruning techniques for the last four levels of a candidate sorting network by considering only the output set representation of a network. We present an algorithm for checking whether an $n$-input sorting network of depth $d$ exists by considering the minimal up to permutation and reflection itemsets at each level and using the pruning at the last four levels. We experimentally evaluated this algorithm to find the optimal depth sorting networks for all $n \leq 12$. △ Less

Submitted 20 February, 2015; originally announced February 2015.

Comments: arXiv admin note: substantial text overlap with arXiv:1502.04748

ACM Class: F.2.2

arXiv:1502.04748 [pdf, ps, other]

The Takeoff Towards Optimal Sorting Networks

Authors: Martin Marinov, David Gregg

Abstract: A complete set of filters $F_n$ for the optimal-depth $n$-input sorting network problem is such that if there exists an $n$-input sorting network of depth $d$ then there exists one of the form $C \oplus C'$ for some $C \in F_n$. Previous work on the topic presents a method for finding complete set of filters $R_{n, 1}$ and $R_{n, 2}$ that consists only of networks of depths one and two respectivel… ▽ More A complete set of filters $F_n$ for the optimal-depth $n$-input sorting network problem is such that if there exists an $n$-input sorting network of depth $d$ then there exists one of the form $C \oplus C'$ for some $C \in F_n$. Previous work on the topic presents a method for finding complete set of filters $R_{n, 1}$ and $R_{n, 2}$ that consists only of networks of depths one and two respectively, whose outputs are minimal and representative up to permutation and reflection. Our main contribution is a practical approach for finding a complete set of filters $R_{n, 3}$ containing only networks of depth three whose outputs are minimal and representative up to permutation and reflection. In previous work, we have developed a highly efficient algorithm for finding extremal sets ( i.e. outputs of comparator networks; itemsets; ) up to permutation. In this paper we present a modification to this algorithm that identifies the representative itemsets up to permutation and reflection. Hence, the presented practical approach is the successful combination of known theory and practice that we apply to the domain of sorting networks. For $n < 17$, we empirically compute the complete set of filters $R_{n, 2}$, $R_{n, 3}$, $R_{n, 2} \upharpoonright w $ and $R_{n, 3}^w$ of the representative minimal up to permutation and reflection $n$-input networks, where all but $R_{n, 2}$ are novel to this work. △ Less

Submitted 12 March, 2015; v1 submitted 16 February, 2015; originally announced February 2015.

ACM Class: F.2.2

arXiv:1310.3739 [pdf, ps, other]

doi 10.1093/mnras/stt1808

The X-ray Spectrum and Spectral Energy Distribution of FIRST J155633.8+351758: a LoBAL Quasar with a Probable Polar Outflow

Authors: Robert C. Berrington, Michael S. Brotherton, Sarah C. Gallagher, Rajib Ganguly, Zhaohui Shang, Michael DiPompeo, Ritaban Chatterjee, Mark Lacy, Michael D. Gregg, Patrick B. Hall, S. A. Laurent-Muehleisen

Abstract: We report the results of a new 60 ks Chandra X-ray Observatory Advanced CCD Imaging Spectrometer S-array (ACIS-S) observation of the reddened, radio-selected, highly polarized `FeLoBAL' quasar FIRST J1556+3517. We investigated a number of models of varied sophistication to fit the 531-photon spectrum. These models ranged from simple power laws to power laws absorbed by hydrogen gas in differing io… ▽ More We report the results of a new 60 ks Chandra X-ray Observatory Advanced CCD Imaging Spectrometer S-array (ACIS-S) observation of the reddened, radio-selected, highly polarized `FeLoBAL' quasar FIRST J1556+3517. We investigated a number of models of varied sophistication to fit the 531-photon spectrum. These models ranged from simple power laws to power laws absorbed by hydrogen gas in differing ionization states and degrees of partial covering. Preferred fits indicate that the intrinsic X-ray flux is consistent with that expected for quasars of similarly high luminosity, i.e., an intrinsic, dereddened and unabsorbed optical to X-ray spectral index of -1.7. We cannot tightly constrain the intrinsic X-ray power-law slope, but find indications that it is flat (photon index Gamma = 1.7 or flatter at a >99% confidence for a neutral hydrogen absorber model). Absorption is present, with a column density a few times 10^23 cm^-2, with both partially ionized models and partially covering neutral hydrogen models providing good fits. We present several lines of argument that suggest the fraction of X-ray emissions associated with the radio jet is not large. We combine our Chandra data with observations from the literature to construct the spectral energy distribution of FIRST J1556+3517 from radio to X-ray energies. We make corrections for Doppler beaming for the pole-on radio jet, optical dust reddening, and X-ray absorption, in order to recover a probable intrinsic spectrum. The quasar FIRST J1556+3517 seems to be an intrinsically normal radio-quiet quasar with a reddened optical/UV spectrum, a Doppler-boosted but intrinsically weak radio jet, and an X-ray absorber not dissimilar from that of other broad absorption line quasars. △ Less

Submitted 14 October, 2013; originally announced October 2013.

Comments: to be published in MNRAS

arXiv:1111.0061 [pdf, ps, other]

doi 10.1088/2041-8205/743/1/L4

The Lick AGN Monitoring Project 2011: Reverberation Map** of Markarian 50

Authors: A. J. Barth, A. Pancoast, S. J. Thorman, V. N. Bennert, D. J. Sand, W. Li, G. Canalizo, A. V. Filippenko, E. L. Gates, J. E. Greene, M. A. Malkan, D. Stern, T. Treu, J. -H. Woo, R. J. Assef, H. -J. Bae, B. J. Brewer, T. Buehler, S. B. Cenko, K. I. Clubb, M. C. Cooper, A. M. Diamond-Stanic, K. D. Hiner, S. F. Hoenig, M. D. Joner , et al. (24 additional authors not shown)

Abstract: The Lick AGN Monitoring Project 2011 observing campaign was carried out over the course of 11 weeks in Spring 2011. Here we present the first results from this program, a measurement of the broad-line reverberation lag in the Seyfert 1 galaxy Mrk 50. Combining our data with supplemental observations obtained prior to the start of the main observing campaign, our dataset covers a total duration of… ▽ More The Lick AGN Monitoring Project 2011 observing campaign was carried out over the course of 11 weeks in Spring 2011. Here we present the first results from this program, a measurement of the broad-line reverberation lag in the Seyfert 1 galaxy Mrk 50. Combining our data with supplemental observations obtained prior to the start of the main observing campaign, our dataset covers a total duration of 4.5 months. During this time, Mrk 50 was highly variable, exhibiting a maximum variability amplitude of a factor of 4 in the U-band continuum and a factor of 2 in the H-beta line. Using standard cross-correlation techniques, we find that H-beta and H-gamma lag the V-band continuum by tau_cen = 10.64(-0.93,+0.82) and 8.43(-1.28,+1.30) days, respectively, while the lag of He II 4686 is unresolved. The H-beta line exhibits a symmetric velocity-resolved reverberation signature with shorter lags in the high-velocity wings than in the line core, consistent with an origin in a broad-line region dominated by orbital motion rather than infall or outflow. Assuming a virial normalization factor of f=5.25, the virial estimate of the black hole mass is (3.2+-0.5)*10^7 solar masses. These observations demonstrate that Mrk 50 is among the most promising nearby active galaxies for detailed investigations of broad-line region structure and dynamics. △ Less

Submitted 31 October, 2011; originally announced November 2011.

Comments: Accepted for publication in ApJ Letters. 6 pages, 4 figures

arXiv:1101.5399 [pdf, other]

doi 10.1051/0004-6361/201015939

The Globular Cluster Systems of Abell 1185

Authors: Michael J. West, Andres Jordan, John P. Blakeslee, Patrick Cote, Michael D. Gregg, Marianne Takamiya, Ronald O. Marzke

Abstract: We examine the properties of a previously discovered population of globular clusters in the heart of the rich galaxy cluster Abell 1185 that might be intergalactic in nature. Deep images obtained with the Advanced Camera for Surveys (ACS) aboard Hubble Space Telescope (HST) confirm the presence of ~ 1300 globular clusters brighter than I_{F814W} = 27.3 mag in a field devoid of any large galaxies.… ▽ More We examine the properties of a previously discovered population of globular clusters in the heart of the rich galaxy cluster Abell 1185 that might be intergalactic in nature. Deep images obtained with the Advanced Camera for Surveys (ACS) aboard Hubble Space Telescope (HST) confirm the presence of ~ 1300 globular clusters brighter than I_{F814W} = 27.3 mag in a field devoid of any large galaxies. The luminosities and colors of these objects are found to be similar to those of metal-poor globular clusters observed in many galaxies to date. Although a significant fraction of the detected globular clusters undoubtedly reside in the outer halos of galaxies adjacent to this field, detailed modeling of their distribution suggests that the majority of these objects are likely to be intergalactic, in the sense that they are not gravitationally bound to any individual galaxy. We conclude that the true nature and origin of the globular cluster population in the core of A1185 -- galactic residents or intergalactic wanderers -- remains uncertain, and suggest how future observation could resolve this ambiguity. △ Less

Submitted 27 January, 2011; originally announced January 2011.

Comments: Accepted for publication in Astronomy and Astrophysics, 13 pages, 15 figures

arXiv:1010.3728 [pdf, ps, other]

doi 10.1111/j.1365-2966.2010.17870.x

Implications of Dramatic Broad Absorption Line Variability in the Quasar FBQS J1408+3054

Authors: Patrick B. Hall, Konstantin Anosov, R. L. White, W. N. Brandt, M. D. Gregg, R. R. Gibson, R. H. Becker, D. P. Schneider

Abstract: We have observed a dramatic change in the spectrum of the formerly heavily absorbed `overlap**-trough' iron low-ionization broad absorption line (FeLoBAL) quasar FBQS J1408+3054. Over a time span of between 0.6 to 5 rest-frame years, the Mg II trough outflowing at 12,000 km/s decreased in equivalent width by a factor of two and the Fe II troughs at the same velocity disappeared. The most likely… ▽ More We have observed a dramatic change in the spectrum of the formerly heavily absorbed `overlap**-trough' iron low-ionization broad absorption line (FeLoBAL) quasar FBQS J1408+3054. Over a time span of between 0.6 to 5 rest-frame years, the Mg II trough outflowing at 12,000 km/s decreased in equivalent width by a factor of two and the Fe II troughs at the same velocity disappeared. The most likely explanation for the variability is that a structure in the BAL outflow moved out of our line of sight to the ultraviolet continuum emitting region of the quasar's accretion disk. Given the size of that region, this structure must have a transverse velocity of between 2600 km/s and 22,000 km/s. In the context of a simple outflow model, we show that this BAL structure is located between approximately 5800 and 46,000 Schwarzschild radii from the black hole. That distance corresponds to 1.7 to 14 pc, 11 to 88 times farther from the black hole than the H-beta broad-line region. The high velocities and the parsec-scale distance for at least this one FeLoBAL outflow mean that not all FeLoBAL outflows can be associated with galaxy-scale outflows in ultraluminous infrared galaxies transitioning to unobscured quasars. The change of FBQS J1408+3054 from an FeLoBAL to a LoBAL quasar also means that if (some) FeLoBAL quasars have multiwavelength properties which distinguish them from HiBAL quasars, then some LoBAL quasars will share those properties. Finally, we extend previous work on how multiple-epoch spectroscopy of BAL and non-BAL quasars can be used to constrain the average lifetime of BAL episodes (currently >60 rest-frame years at 90% confidence). △ Less

Submitted 8 December, 2010; v1 submitted 18 October, 2010; originally announced October 2010.

Comments: Final version to appear in MNRAS: references added and factor of 2 underestimate of accretion disk size corrected, resulting in absorber constrained to be somewhat closer to the black hole. For an animated gif showing the spectral evolution of the broad absorption line troughs in this quasar, see http://www.yorku.ca/phall/film19952009.gif

arXiv:1007.0028 [pdf, ps, other]

doi 10.1088/0067-0049/189/1/83

Spectropolarimetry of Radio-Selected Broad Absorption Line Quasars

Authors: M. A. DiPompeo, M. S. Brotherton, R. H. Becker, H. D. Tran, M. D. Gregg, R. L. White, S. A. Laurent-Muehleisen

Abstract: We report spectropolarimetry of 30 radio-selected broad absorption line (BAL) quasars with the Keck Observatory, 25 from the sample of Becker et al. (2000). Both high and low-ionization BAL quasars are represented, with redshifts ranging from 0.5 to 2.5. The spectropolarimetric properties of radio-selected BAL quasars are very similar to those of radio-quiet BAL quasars: a sizeable fraction (20%)… ▽ More We report spectropolarimetry of 30 radio-selected broad absorption line (BAL) quasars with the Keck Observatory, 25 from the sample of Becker et al. (2000). Both high and low-ionization BAL quasars are represented, with redshifts ranging from 0.5 to 2.5. The spectropolarimetric properties of radio-selected BAL quasars are very similar to those of radio-quiet BAL quasars: a sizeable fraction (20%) show large continuum polarization (2-10%) usually rising toward short wavelengths, emission lines are typically less polarized than the continuum, and absorption line troughs often show large polarization jumps. There are no significant correlations between polarization properties and radio properties, including those indicative of system orientation, suggesting that BAL quasars are not simply normal quasars seen from an edge-on perspective. △ Less

Submitted 30 June, 2010; originally announced July 2010.

Journal ref: Astrophysical Journal Supplement Series, 189:83-103, 2010 July

arXiv:1005.5570 [pdf, ps, other]

doi 10.1088/0004-6256/140/2/403

The Sloan Digital Sky Survey Quasar Lens Search. IV. Statistical Lens Sample from the Fifth Data Release

Authors: Naohisa Inada, Masamune Oguri, Min-Su Shin, Issha Kayo, Michael A. Strauss, Joseph F. Hennawi, Tomoki Morokuma, Robert H. Becker, Richard L. White, Christopher S. Kochanek, Michael D. Gregg, Kuenley Chiu, David E. Johnston, Alejandro Clocchiatti, Gordon T. Richards, Donald P. Schneider, Joshua A. Frieman, Masataka Fukugita, J. Richard Gott III, Patrick B. Hall, Donald G. York, Francisco J. Castander, Neta A. Bahcall

Abstract: We present the second report of our systematic search for strongly lensed quasars from the data of the Sloan Digital Sky Survey (SDSS). From extensive follow-up observations of 136 candidate objects, we find 36 lenses in the full sample of 77,429 spectroscopically confirmed quasars in the SDSS Data Release 5. We then define a complete sample of 19 lenses, including 11 from our previous search in t… ▽ More We present the second report of our systematic search for strongly lensed quasars from the data of the Sloan Digital Sky Survey (SDSS). From extensive follow-up observations of 136 candidate objects, we find 36 lenses in the full sample of 77,429 spectroscopically confirmed quasars in the SDSS Data Release 5. We then define a complete sample of 19 lenses, including 11 from our previous search in the SDSS Data Release 3, from the sample of 36,287 quasars with i<19.1 in the redshift range 0.6<z<2.2, where we require the lenses to have image separations of 1"<θ<20" and i-band magnitude differences between the two images smaller than 1.25 mag. Among the 19 lensed quasars, 3 have quadruple-image configurations, while the remaining 16 show double images. This lens sample constrains the cosmological constant to be Ω_Λ=0.84^{+0.06}_{-0.08}(stat.)^{+0.09}_{-0.07}(syst.) assuming a flat universe, which is in good agreement with other cosmological observations. We also report the discoveries of 7 binary quasars with separations ranging from 1.1" to 16.6", which are identified in the course of our lens survey. This study concludes the construction of our statistical lens sample in the full SDSS-I data set. △ Less

Submitted 30 May, 2010; originally announced May 2010.

Comments: 37 pages, 2 figures and 5 tables, accepted to AJ

Journal ref: Astron.J.140:403-415,2010

Showing 1–50 of 137 results for author: Gregg, D