COde enHancemENts for VEctorizing Compilers by virtual performance evaluation

How applications exploit advanced hardware features in order to run faster is partly related to the methods and techniques that compiler generates machine codes for Central processing units (CPUs) which plays a crucial role in computer systems. Performance of current CPUs is considerably increased and focusing of enterprise programmers on hardware and bottom layer of software causes more costs. Currently, a huge gap between applications and hardware waste a lot of money and because of unused processing power, it needs extra programming efforts in order to use hardware capabilities, properly. For reducing the mentioned costs compiler must generate codes that exploit available hardware features without the need for using extra programming approaches in application development domain. Furthermore, in applications such as multimedia applications which are limited to the speed of the computations on multimedia data, it is very important to use high-performance hardware and software. In addition, modern CPUs are equipped with multimedia extensions to perform the computations using vector processing units. In order to map the algorithm to vector like operations vectorization of multimedia, source codes became an important field of research in theoretical and practical computer science.

So our goal in this project is, design and implement a new vectorization strategy based on the features and requirements of multimedia applications. In the one hand, there are many optimization techniques in order to generate proper code. In other hand, compilers must know the processor micro-architecture for better use of processing power.
For doing these tasks, there are many steps to compile the source code in order to gain more performance. Because we consider the current GPPs development path as an unsuitable approach a new x86 compatible processor will be designed. Additionally, special attention will be given for exploiting a large amount of Data Level Parallelism (DLP) and Thread Level Parallelism (TLP) in applications. In multimedia applications, there are many computational results which can be reused in the next algorithm step thus reuse the result in functional units and registers need to design new instructions to get more performance.

Performance Evaluation of Implicit and Explicit SIMDization

Abstract – Processor vendors have been expanding Single Instruction Multiple Data (SIMD) extensions to exploit data-level-parallelism in their General Purpose Processors (GPPs). Each SIMD technology such as Streaming SIMD Extensions (SSE) and Advanced Vector eXtensions (AVX) has its own Instruction Set Architecture (ISA) which equipped with Special Purpose Instructions (SPIs). In order to exploit these features, many programming approaches have been developed. Intrinsic Programming Model (IPM) is a low-level concept for explicit SIMDization. Besides, Compiler’s Automatic Vectorization (CAV) has been embedded in modern compilers such as Intel C++ compiler (ICC), GNU Compiler Collections (GCC) and LLVM for implicit vectorization. Each SIMDization shows different improvements because of different SIMD ISAs, vector register width, and programming approaches. Our goal in this paper is to evaluate the performance of explicit and implicit vectorization. Our experimental results show that the behavior of explicit vectorization on different compilers is almost the same compared to implicit vectorization. IPM improves the performance more than CAVs. In general, ICC and GCC compilers can more efficiently vectorize kernels and use SPI compared to LLVM. In addition, AVX2 technology is more useful for small matrices and compute-intensive kernels compared to large matrices and data-intensive kernels because of memory bottlenecks. Furthermore, CAVs fail to vectorize kernels which have overlapping and non-consecutive memory access patterns. The way of implementation of a kernel impacts the vectorization. In order to understand what kind of scalar implementations of an algorithm is suitable for vectorization, an approach based on code modification technique is proposed. Our experimental results show that scalar implementations that have either loop collapsing, loop unrolling, software pipelining, or loop exchange techniques can be efficiently vectorized compared to straightforward implementations.
In sciencedirect