Research of Computer Architecture and Parallel Computing
Martti Forsell, PhD ( Martti . Forsell (at) VTT . Fi )
2. Docent (Adjunct Professor), Laboratory of Computer Science and Engineering, Department of Electrical and Information Engineering, University of Oulu, Oulu, Finland
3. Past research activities: Network on Chip Group (Oulu - Stockholm) and Parallel Computing Group (Joensuu, Kuopio)
Parallel computer architecture:
° Multithreaded and multicore architectures
° Realization of strong models of computing
° Fast communication, mesh networks
° Parallel memories
Sequential computer architecture:
° Very long instruction word architectures
° Pipeline and functional unit organization
Chip multiprocessors/Multiprocessor system on chip:
° General purpose architectures
° Design methodologies (high-level language, VLSI)
° Power saving techniques
• Switch architectures
• Efficient routing algorithms
• High-bandwidth solutions
Application specific MP-SOC architectures
• Baseband processing
• RF processing
Performance, area and power modeling:
° Analytical performance, area and power models
° Functional and trace-based simulations
° Benchmark kernels
• General purpose bechmarks
Advanced compiling and optimization techniques:
° TLP compiling techniques
° ILP-TLP co-optimization
General theory of computation:
° Differences of sequential and parallel computation
° Synchronous shared memory models (e.g. PRAM)
• Hybrid models
° Parallel programming languages
° Fundamental and philosophical limits of computing
List of publications can be found here.
Now that multicore chips are evolving to many core chips implying fundamental changes both in architectures and application development. Fast, area and power efficient parallel chip multiprocessor (CMP), multiprocessor system-on-a-chip (MP-SOC) and parallel computer architectures that provide support for easy-to-program parallel software development are needed more than ever. We will give here some background to these things and describe how we take a radical new approach to these problems.
The performance of processors has increased exponentially for 40 years as a result of shrinking feature size and architectural improvements that have raised the clock frequency and the number of executed instructions per clock cycle. This has made multiprocessor designs, being hard to program and providing at most linear speedups, virtually impractical since single processor systems have been easy to program and they have catched the performance of their multiprocessor counterparts quite soon. Recently, this development has slowed down radically due to heating problems that have almost killed the growth of clock frequencies and increasing resistance of shrinking wires that effectively prevents raising the number of functional units and thus the number of executed instruction per clock cycle for general purpose applications. Processor architects have started to address these problems by decreasing the clock frequency, supply voltage, and the size of an individual processing element. Growth of performance is now seeked from parallelism by replicating processing elements and trading ILP to TLP. This will not be easy since so far the majority of general purpose computing engines, e.g. AMD Opteron, Intel Pentium 4, IBM PowerPC G5, has been designed with maximal frequency and sequential execution paradigm in mind and getting the full combined performance of multiple processing cores seems to work only for independent processes. Among the most promising technologies utilizing TLP with future silicon technology are CMPs / MP-SOCs—integrating a number of computational resources on a single chip with sufficient amount of fast storage and providing a design methodology emphasizing regularity and reuse of existing designs—and NOCs—providing a standardized high-bandwidth infrastructure to communicate among multiple computing resources on a single chip—since having a large number of fast inter-chip pins seems increasingly difficult. While parallel easy-to-use high-performance systems have been studied for a long time among the scientific community, typical CMP / MP-SOC designs borrow architectural techniques from sequential computing, rely on standard message passing libraries, like MPI, failing to provide either easy programmability or high performance, or use SMP or CC-NUMA shared memory providing good performance only in a domain well-behaving applications with relatively independent parallel components, like digital signal/media processing.
Our main contribution is the family of scalable Emulated shared memory (ESM) CMP / MP-SOC architectures to address both programmability and performance issues of CMP / MP-SOC design for a wide range of general purpose applications [6, 36, 39] (see Figure 1 on the left hand side of this page and list of publications for references). The performance of ESM CMPs / MP-SOCs has turned out to be insensitive to the degree of intercommunication . An ESM CMP / MP-SOC consists of chained multithreaded processors  with dedicated instruction memory modules , highly interleaved data memory modules  and a high-capacity sparse mesh (SM) interconnection network . The key features of CMPs / MP-SOCs are
• easy-to-program lock-step-synchronous arbitrary ordered multiprefix CRCW PRAM-style programming model providing uniform shared memory with the single step access latency and fine-grained machine instruction level synchronous parallel execution,
• software-based design methodology ultimately supporting flexibility and general purpose operation (see "Parallel application development" below) [28,34],
• homogeneous structure for simplifying design and making it easier to integrate it as a part of larger system,
• zero overhead multithreading for hiding the memory latency, balancing computation and memory access, and providing support for fine-grained TLP ,
• superpipelining for clock cycle minimization,
• hazard-free interthread pipelining with multiple functional units organized as a chain for hiding the thread switch costs and maximizing ILP within a superstep of parallel execution [3,5,26],
• totally cacheless design for avoiding cache coherency problems and simplifying the memory system ,
• a physically feasible acyclic two-dimensional sparse mesh network exploiting locality for high-bandwidth and deadlock free interresource communication,
• randomized hashing of memory words across the modules to avoid hotspots and congestion in the network,
• wave based synchronization separating references belonging to successive supersteps and allowing for different parts of the machine to execute different phases of the program depending on the progress of communication,
• an advanced explicit synchronization mechanism supporting arbitrary multiple simultaneous barrier synchronization , and
• highly interleaved memory modules for eliminating the speed difference between processors and memory banks .
We have committed recently thorough feasibility studies of some Eclipse variants indicating that they are indeed feasible on silicon [42,43]. The next step would be building an FPGA prototype of an efficicient Eclipse CMP /MP-SOC architecture.
Parallel application development
Highly productive and easy to use/migrate explicitly parallel programming paradigms, languages, and tools build on a top of efficient parallel architectures. We explain here shortly our approach to these problems.
Consider a computing model of ESM CMPs / MP-SOCs consisting of ILP and TLP components: The TLP component models a fixed number of physical threads and a uniform synchronous shared data memory with single step memory access time. Execution happens in steps, which are transparent to the user. During a step each thread of each processor executes an instruction. The second component is the chained ILP programming model dividing an instruction to a fixed number of subinstructions, which are executed in the corresponding FU in chain-like manner sequentially. This kind of technique, chaining, allows a unit to use the results of the preceding units so that a portion of strictly sequential subinstruction can be executed during a single instruction . This is not possible with models based on parallel organization of FUs. We call the obtained ability to execute multiple subinstructions during a step virtual ILP. While the second component is inefficient alone, the components can be combined with the first component by interthread superpipelining to a very efficient programming model so that the slackness of a parallel application will allow completely hazard-free superpipelined operation.
Programming happens with a high-level TLP language, like e , or with plain parallel assembler. E is an experimental TLP programming language designed for general purpose parallel computers invented by us. The syntax of the e-language is an extension of the syntax of familiar c-language. E supports parallely recursive and synchronous MIMD programming for various PRAM models, shared and private variables, arbitrary hierarchical groups of threads, and synchronous control structures. This allows a programmer to use various advanced TLP programming techniques like data parallelism, divide-and-conquer techniques, different blocking techniques, and both synchronous and asynchronous programming style to express aimed functionality efficiently.
To optimize virtual ILP component in a program compiled from the high-level TLP language one can utilize a basic block scheduling algorithm that divides each block into memory reference blocks or chained instructions and schedules the subinstructions of the basic block by trying to fill memory reference blocks, one by one, by subinstructions from the successor blocks so that ordering defined by the dependencies will not be violated . In the best case the subinstructions of a basic block are packed into a single instruction. The virtual ILP optimizer can be placed into the back-end of the compiler allowing one to optimize the code to multiple ILP configurations without applying the front end part of the compiler again.
The design flow of the application development scheme consists of eight parts (see Figure 2 on the left hand side of this page):
1. Describe computational problems as algorithms
2. Implement algorithms as programs
3. Map tasks to thread groups
4. Determine the needed hardware
5. Compile and link
6. Optimize virtual ILP
7. Implement the hardware configuration
8. Load the code to the machine
Proposals for research co-operation and positive funding decisions are welcome!
// Add a_ to elements of b_ in parallel with e
int a_; // A shared variable
int b_[size]; // A shared array of integers
int tmp_[size]; // Thread-wise copies of a_
// EREW version, O(log N) time:
// Spread a_ to tmp_ with a logarithmic algorithm
if_ ( _thread_id==0 , tmp_=a_; );
for (i=1; i<_number_of_threads; i<<=1)
if_ ( _thread_id-i>=0 ,
// CRCW version, O(1) time:
// Sort src_ in parallel with e in O(1) time
int src_[N]; // Array to be sorted
int tgt_[N]; // Rank for each element
// Multiprefix CRCW:
int i = _thread_id >> logn;
int j = _thread_id & (N-1);
if_ ( _thread_id<N2 ,
if_ ( _thread_id<N ,
Last updated July 28, 2011, MF.