Below you find available abstracts for these talks given. Besides this subject also the research group itself is presented.
The abstracts are in random order, and by which university given:
- Accelerating sequential computer vision algorithms using commodity parallel hardware
– NHL Leeuwarden – Jaap van Loosdrecht
- A General Toolkit for “GPUtilisation” in SME Applications
– Groningen university
- Modularity Graph Clustering on the GPU
– University of Utrecht
- OpenCL against the world: a battle for performance
– Delft University – Ana Lucia Varbanescu
- The ‘Bones’ Source-to-Source Compiler: Making Parallel Programming Easy
– Eindhoven University of Technology
- Mapping Streaming Applications on Heterogeneous Platforms with Data Parallel Accelerators
– LIACS, Leiden – Ana Balevic
- Correct and Efficient Accelerator Programming
– University of Enschede – Marieke Huisman
- GPGPU in Radio Astronomy
– Netherlands eScience center / Astron / VU Amsterdam
- GPGPU in bioinformatics: speeding up sequence alignment analysis
– Hanze University Groningen / Wageningen University
- Sparse octrees and GPGPU
- GPU computing for tomography
– CWI, Amsterdam / University of Antwerp, Belgium
- Diffusion-weighted magnetic resonance imaging tractography
– Maastricht University
- The unique challenges of producing compilers for GPUs
– Codeplay, UK – Andrew Richards
- A tour of LLVM, and why it is important for GPU software
– Codeplay, UK – Paul Keir
- DEGIMA, The greenest accelerator-based supercomputer in the World
– Nagasaki, JP – Tsuyoshi Hamada
Accelerating sequential computer vision algorithms using commodity parallel hardware
Jaap van de Loosdrecht
NHL Centre of Expertise in Computer Vision
Research project is a collaboration between the NHL Centre of Expertise in Computer Vision, Limerick Institute of Technology and Van de Loosdrecht Machine Vision BV (VdLMV). The primary objective of this project is to develop knowledge and experience in the field of multi-core CPU and GPU programming in order to accelerate, in an effective manner, a huge base of legacy sequential computer vision algorithms and to use that knowledge and experience in order to develop new algorithms. This project focus on both multi-core CPUs and GPUs. Tools used are OpenMP and OpenCL.
VisionLab is a software package in development by VdLMV. VisionLab is a portable library (C++) and a development environment for machine vision applications, pattern recognition and classification with neural networks. VisionLab runs on Windows, Linux and Android platforms with x86, x64, ARM or PowerPC architecture. More than 160 operators of the library have been parallelized using OpenMP for multi-core CPUs. There is also a toolbox for using OpenCL kernels from both script language and C++.
This presentation will focus on using the toolbox for developing, testing and benchmarkingOpenCL kernels. Normally writing the host side code in C(++) is labour-intensive and sensitive for errors. The script language of VisionLab has been extended with commands to call the OpenCL host API. Using these extensions the time spent in writing the host side code is largely reduced, facilitates the testing of kernels and benchmarking with the sequential and OpenMP versions of the operators.
A General Toolkit for “GPUtilisation” in SME Applications
David Williams and Vali Codreanu
RuG, Scientific Visualization and Computer Graphics
The recent explosion of GPU (Graphical Processing Unit) power has not been fully utilised by many SMEs (small and medium-sized enterprises), possibly because GPUprogramming requires specialist skills different from those of conventional programming. GPSME will provide the SME participants with a simple route to accessing GPU power. Each of the SME participants is facing increasing competition in the market and GPSME will allow them to greatly improve their products in terms of speed and quality without major overheads. By involving close cooperation between the SMEs and RTD performers, GPSME will develop a toolkit to automate the conversion of existing sequential CPU code to an optimal GPU implementation. With such a toolkit, the SMEs will be able to convert their existing CPU code without committing significant effort and time. It will also support the execution of advanced techniques within acceptable runtimes and hence allow the SMEs to use more complex computing models in their new products. This will bring them major commercial benefits and significantly improve their market positions.
Modularity Graph Clustering on the GPU
Bas Fagginger Auer (joint work with prof. dr. Rob Bisseling)
University of Utrecht
Clustering pertains to dividing the vertices of a given graph (e.g. a social network, where vertices represent people and edges pairs of friends) into meaningful subgroups, called clusters.
Here, meaningful means having a lot of connections within each cluster, but relatively few connections between different clusters.
Finding a good clustering of a graph’s vertices has applications in a variety of fields, for example to find clusters of related proteins in PPI-networks in bioinformatics.
For this talk, the quality of a clustering will be measured by calculating its modularity (introduced by Newman and Girvan in 2004).
Maximising modularity is NP-hard, so we use agglomerative clustering as an effective greedy heuristic to generate graph clusterings of high modularity in a small amount of time.
To achieve good performance, we have developed a parallel algorithm which uses the processing power offered by multi-core CPU and GPU hardware to solve the clustering problem.
This heuristic is able to generate clusterings in very little time: a modularity 0.996 clustering is obtained from a street network graph with 14 million vertices and 17 million edges in 4.6 seconds on the GPU.
OpenCL against the world: a battle for performance.
dr. Ana Lucia Varbanescu
University of Delft
Rumor has it that OpenCL lacks performance portability. Some skeptics even say OpenCLhas lost the performance battle all together on both CPUs and GPUs. In our work, we have challenged both these ideas by devising a systematic and very thorough performance comparison of OpenCL with both OpenMP and CUDA, thus addressing both its CPU and GPU behavior. This presentation summarizes our most interesting findings, grouped in three important categories: OpenCL‘s perfomance, OpenCL‘s performance portability, and the (un)fairness of programming models comparisons. Our ultimate conclusion is thatOpenCL is competitive, performance-wise, with the rest of the programming models world, but this can only come at the expense of its performance portability. Whether this impasse can and/or should be solved by the standard, by the programmers, or by the compilers/runtimes remains an interesting open question for the community.
The ‘Bones’ Source-to-Source Compiler: Making Parallel Programming Easy
Eindhoven University of Technology
Recent advances in multi-core and many-core processors requires programmers to exploit an increasing amount of parallelism from their applications. Data parallel languages such as CUDA and OpenCL make it possible to take advantage of such processors, but still require a large amount of effort from programmers.
A number of parallelizing source-to-source compilers have recently been developed to ease programming of multi-core and many-core processors. In this presentation, we will see a few of such tools, focused in particular on C-to-CUDA transformations targeting GPUs. These tools will be compared to each other and their strengths and weaknesses will be identified.
In this presentation we introduce a new tool, the source-to-source compiler ‘Bones’, based on the algorithmic skeletons technique. The compiler generates parallel code (CUDA,OpenCL, OpenMP) based on skeletons of parallel structures, which can be seen as parameterisable library implementations for a set of algorithm classes.The presented compiler requires little modifications to the original sequential source code, generates readable code for further fine-tuning, and delivers superior performance compared to other tools for a set of 8 image processing kernels.
Mapping Streaming Applications on Heterogeneous Platforms with Data Parallel Accelerators
Impressive computational speedups have been reported for GPU acceleration of numerous algorithms in fields of medical image processing, digital signal processing, astrophysics, modeling and simulations. With the increasing need for computational power required for streaming data processing in embedded and HPC domains, the trend shifts to heterogeneous platforms composed of a mix of multi-core processors, special function units and accelerators such as FPGAs and GPUs. While parallelization of computational kernels for a multi-core CPU or a GPU is a specialist task requiring high degree of expertise and skills, mapping an application onto a heterogeneous platform presents even greater challenges. In this talk, we review the state of the art in automatic parallelization techniques, and present a novel model-based approach that assists mapping of streaming applications onto heterogeneous platforms with GPUs and enables exploitation of task, data and pipeline parallelism.
Correct and Efficient Accelerator Programming
dr. Marieke Huisman
University of Twente
The importance of massively parallel accelerator processors has been widely recognized by the overall ICT community. Accelerator processors, and primarily GPU’s offer a huge computing power at a relatively low cost. In some tasks such as media processing, simulation, medical imaging and eye-tracking, GPU’s can beat CPU-performance by orders of magnitude. There is also an increasing usage of accelerator processors in scientific computing, especially in numerical linear algebra. Despite all the advantages, accelerator programming is still low level. To get accelerator programming to a higher level, applications must exhibit portable correctness, operating correctly on any configuration of accelerators, and portable performance, exploiting processing power and energy efficiency offered by a wide range of devices. The FMT group is participating in the CARP project, whose aim is to design techniques and tools for high-level correct and efficient accelerator programming.
In this talk we will present the main goals and objectives of the CARP project and introduce its participants. FMT’s role in the project focuses on the semantics and verification of accelerator programs. Within the project, a high level accelerator programming language Pencil will be developed. We will make sure this language has a well-defined semantics. In addition, we will develop verification techniques for Pencil programs. For this, we plan to base ourselves on our experiences with the verification of concurrent Java programs. In this talk, we will give a brief outline how we do this, and how we plan to adapt this to Pencil programs.
GPGPU in Radio Astronomy
dr. Rob van Nieuwpoort
Netherlands eScience center / Astron / VU Amsterdam
LOFAR is the largest radio telescope in the world. It is built by Astron in the Netherlands. The communication and computational challenges faced by this instrument are enormous. We need hundreds of teraflops of computational power, and hundreds of gigabits of I/O connectivity. Currently, Astron uses a 2.5-rack IBM Blue Gene/P supercomputer in Groningen to do real-time processing, and separate clusters for the different astronomy science cases. This presentation discusses the use of GPU technology in LOFAR. We have developed codes to perform many signal processing operations on GPUs. However, these operations are very data-intensive, so mapping them to GPUs is difficult. Nevertheless, performance gains, and especially increased energy efficiency make it an attractive approach.
GPGPU in bioinformatics: speeding up sequence alignment analysis
Institute for Life Science & Technology, Hanze University of Applied Sciences Groningen
Currently available next-generation sequencing platforms produce millions of short DNA sequences, from 30 bases up to several hundred bases. These are analyzed for genomic variances, regulatory elements or whole genome (re)sequencing. The resulting data sets are up to terabytes in size. The availability of fast, flexible and highly accurate sequence alignment software is important for analyzing such sequence data. Currently available software applications are either not enough fast, too dedicated or not sufficiently accurate due to statistical assumptions. The Parallel Smith-Waterman Alignment Software (PaSWAS) gives easy access to the computational power of general purpose graphics processing units (GPGPUs) to perform high-speed sequence alignments highly accurately. The implementation exploits the parallel nature of the Smith-Waterman algorithm with the ability of a GPGPU to perform many sequence alignments in parallel. PaSWAS minimizes the data transfer by copying only the sequences to the GPU and transferring the relevant information directly to CPU RAM after the calculations. For flexibility and user-friendliness, a Python application is being developed using pyCUDA, bioPython and other standard libraries. This way PaSWAS can easily be integrated into existing analysis pipelines and applications.
Sheets not available.
Sparse octrees and GPGPU
Octree is one of the fundamental data structures that is frequently used to partition a three dimensional space by recursively subdividing it into eight octants. A sparse octree is an octree with most of its octants empty, and it is usually generated from data sets with large degree of clustering. The applications of sparse octrees extend from the range search queries to the science grade poisson solvers on unstructured point sets. However, the natural recursive algorithms, and their iterative implementations, which has been successfully used on single- and many-threaded processors are generally not capable of delivering satisfactory performance on massively parallel GPGPU systems. In this contribution I will describe how one can extract high degree of parallelism from sparse octrees in order to leverage the massive horse power of modern GPGPU processors.
GPU computing for tomography
Prof. Dr. Joost Batenburg
Centrum Wiskunde & Informatica, Amsterdam / Vision Lab, University of Antwerp, Belgium
Tomography is concerned with the reconstruction of images from their projections. A prominent example
of this technique can be found in medical CT scanners, which are capable of creating high resolution
images of a patients’ internal organs. One of the key areas of my research is the development of
theory and reconstruction algorithms for a wide range of applications in tomography,
both medical and non-medical. Availability of massive computation power is crucial to the applicability of
these algorithms, as they typically take days to run on a desktop PC.
The highly parallel architecture of modern GPUs is ideally suited for tomography
computations. We have developed GPU implementations of the core tomography operations, resulting in a
speedup of 40 compared to a high-performance single-core CPU implementation. When the power of several
GPUs is combined in a single system, a single PC can even outperform a large supercomputer cluster
for this type of algorithms, bringing supercomputer performance into the office.
In this talk, the basic concepts of tomography and its computational requirements will be introduced, followed
by an outline of how tomography algorithms can be mapped onto the GPU architecture, and how they perform.
I will conclude with an overview of our work on desktop supercomputing .
 The FASTRA II Desktop Supercomputer, http://fastra2.ua.ac.be
Diffusion-weighted magnetic resonance imaging tractography
Maastricht University – Cognitive Neuroscience/Neuroimaging
In my presentation I will firstly give a general and brief introduction about diffusion MRI data processing and how to reconstruct the whole brain map of cortico-cortical white matter connection, known as CONNECTOME. Several approaches are available to reconstruct such map. I will focus on a recently developed technique for whole brain tractography which uses graph analysis to reconstruct the white matter pathways in the human brain. Such approach is slower than standard deterministic algorithms since it has to compute, for each seed voxel, the map of edge weights for the whole brain volume. In the talk, I will show how I improved the performances of this algorithm using CUDA.
The unique challenges of producing compilers for GPUs
CPU compilers have been around since the 50s. GPU compilers are much more recent. This talk will cover the unique challenges of writing GPU compilers. CPU architectures and languages have been around much longer than GPUs, which means that CPU compilers can be written much more slowly than GPU compilers. This speed of development is a particular problem. But also GPUs benefit from massive levels of parallelism, which means their cores are designed for running large numbers of threads relatively slowly. This creates quite different design constraints compared with CPUs. Andrew is the CEO of Codeplay, which has been writing GPU compilers for a variety of customers for over a decade.
A tour of LLVM, and why it is important for GPU software
Modern compiler technology such as LLVM and CLang has an essential role to play
in future parallel language designs. By providing a common intermediate representation,
LLVM allows languages targeting new architectures to benefit from existing compiler
optimisation passes. Single-source GPGPU languages such as CUDA, OpenMP, C++ AMP,
and OpenACC also make essential use of compiler technology to partition source into
host and accelerator code. The open-source LLVM and CLang projects are written in modern C++;
sponsored by Apple; and used within an increasing number of GPGPU language projects and
products from both industry and academia. In this talk I will give an overview of
LLVM relevant to GPGPU language design.
DEGIMA, The greenest accelerator-based supercomputer in the World
Nagasaki University, JP
At the frontline of high performance computing, we have seen great success on producing faster, cheaper, and smaller systems. One good example of such progress is the GPU-based DEGIMA cluster, which characterises itself for being one of the most cost-efficient and power-efficient supercomputers in the world. However, the same advances that have driven HPC have intensified power related challenges. In this talk we will discuss energy management techniques used in DEGIMA that allow us to maximise performance for a given power budget or improve power-efficiency for a given performance target.