-
Classifying Cycling Hazards in Egocentric Data
Authors:
Jayson Haebich,
Christian Sandor,
Alvaro Cassinelli
Abstract:
This proposal is for the creation and annotation of an egocentric video data set of hazardous cycling situations. The resulting data set will facilitate projects to improve the safety and experience of cyclists. Since cyclists are highly sensitive to road surface conditions and hazards they require more detail about road conditions when navigating their route. Features such as tram tracks, cobbles…
▽ More
This proposal is for the creation and annotation of an egocentric video data set of hazardous cycling situations. The resulting data set will facilitate projects to improve the safety and experience of cyclists. Since cyclists are highly sensitive to road surface conditions and hazards they require more detail about road conditions when navigating their route. Features such as tram tracks, cobblestones, gratings, and utility access points can pose hazards or uncomfortable riding conditions for their journeys. Possible uses for the data set are identifying existing hazards in cycling infrastructure for municipal authorities, real time hazard and surface condition warnings for cyclists, and the identification of conditions that cause cyclists to make sudden changes in their immediate route.
△ Less
Submitted 14 March, 2021;
originally announced March 2021.
-
Exploring performance and power properties of modern multicore chips via simple machine models
Authors:
Georg Hager,
Jan Treibig,
Johannes Habich,
Gerhard Wellein
Abstract:
Modern multicore chips show complex behavior with respect to performance and power. Starting with the Intel Sandy Bridge processor, it has become possible to directly measure the power dissipation of a CPU chip and correlate this data with the performance properties of the running code. Going beyond a simple bottleneck analysis, we employ the recently published Execution-Cache-Memory (ECM) model t…
▽ More
Modern multicore chips show complex behavior with respect to performance and power. Starting with the Intel Sandy Bridge processor, it has become possible to directly measure the power dissipation of a CPU chip and correlate this data with the performance properties of the running code. Going beyond a simple bottleneck analysis, we employ the recently published Execution-Cache-Memory (ECM) model to describe the single- and multi-core performance of streaming kernels. The model refines the well-known roofline model, since it can predict the scaling and the saturation behavior of bandwidth-limited loop kernels on a multicore chip. The saturation point is especially relevant for considerations of energy consumption. From power dissipation measurements of benchmark programs with vastly different requirements to the hardware, we derive a simple, phenomenological power model for the Sandy Bridge processor. Together with the ECM model, we are able to explain many peculiarities in the performance and power behavior of multicore processors, and derive guidelines for energy-efficient execution of parallel programs. Finally, we show that the ECM and power models can be successfully used to describe the scaling and power behavior of a lattice-Boltzmann flow solver code.
△ Less
Submitted 19 March, 2014; v1 submitted 14 August, 2012;
originally announced August 2012.
-
Performance engineering for the Lattice Boltzmann method on GPGPUs: Architectural requirements and performance results
Authors:
Johannes Habich,
Christian Feichtinger,
Harald Köstler,
Georg Hager,
Gerhard Wellein
Abstract:
GPUs offer several times the floating point performance and memory bandwidth of current standard two socket CPU servers, e.g. NVIDIA C2070 vs. Intel Xeon Westmere X5650. The lattice Boltzmann method has been established as a flow solver in recent years and was one of the first flow solvers to be successfully ported and that performs well on GPUs. We demonstrate advanced optimization strategies for…
▽ More
GPUs offer several times the floating point performance and memory bandwidth of current standard two socket CPU servers, e.g. NVIDIA C2070 vs. Intel Xeon Westmere X5650. The lattice Boltzmann method has been established as a flow solver in recent years and was one of the first flow solvers to be successfully ported and that performs well on GPUs. We demonstrate advanced optimization strategies for a D3Q19 lattice Boltzmann based incompressible flow solver for GPGPUs and CPUs based on NVIDIA CUDA and OpenCL. Since the implemented algorithm is limited by memory bandwidth, we concentrate on improving memory access. Basic data layout issues for optimal data access are explained and discussed. Furthermore, the algorithmic steps are rearranged to improve scattered access of the GPU memory. The importance of occupancy is discussed as well as optimization strategies to improve overall concurrency. We arrive at a well-optimized GPU kernel, which is integrated into a larger framework that can handle single phase fluid flow simulations as well as particle-laden flows. Our 3D LBM GPU implementation reaches up to 650 MLUPS in single precision and 290 MLUPS in double precision on an NVIDIA Tesla C2070.
△ Less
Submitted 5 December, 2011;
originally announced December 2011.
-
A Flexible Patch-Based Lattice Boltzmann Parallelization Approach for Heterogeneous GPU-CPU Clusters
Authors:
Christian Feichtinger,
Johannes Habich,
Harald Koestler,
Georg Hager,
Ulrich Ruede,
Gerhard Wellein
Abstract:
Sustaining a large fraction of single GPU performance in parallel computations is considered to be the major problem of GPU-based clusters. In this article, this topic is addressed in the context of a lattice Boltzmann flow solver that is integrated in the WaLBerla software framework. We propose a multi-GPU implementation using a block-structured MPI parallelization, suitable for load balancing an…
▽ More
Sustaining a large fraction of single GPU performance in parallel computations is considered to be the major problem of GPU-based clusters. In this article, this topic is addressed in the context of a lattice Boltzmann flow solver that is integrated in the WaLBerla software framework. We propose a multi-GPU implementation using a block-structured MPI parallelization, suitable for load balancing and heterogeneous computations on CPUs and GPUs. The overhead required for multi-GPU simulations is discussed in detail and it is demonstrated that the kernel performance can be sustained to a large extent. With our GPU implementation, we achieve nearly perfect weak scalability on InfiniBand clusters. However, in strong scaling scenarios multi-GPUs make less efficient use of the hardware than IBM BG/P and x86 clusters. Hence, a cost analysis must determine the best course of action for a particular simulation task. Additionally, weak scaling results of heterogeneous simulations conducted on CPUs and GPUs simultaneously are presented using clusters equipped with varying node configurations.
△ Less
Submitted 8 July, 2010;
originally announced July 2010.