![]() |
|
|||||||
![]() |
![]() |
|
||||||
|
Research of Computer Architecture and Parallel Computing
|
|
|||||||
|
|
||||||||
|
Responsible
Martti Forsell, PhD ( Martti . Forsell (at) VTT . Fi )
2. Docent (Adjunct Professor), Laboratory of Computer Science and Engineering, Department of Electrical and Information Engineering, University of Oulu, Oulu, Finland
3. Past research activities: Network on Chip Group (Oulu - Stockholm) and Parallel Computing Group (Joensuu, Kuopio)
Research interests
Parallel computer architecture:
° Multithreaded and multicore architectures
° Realization of strong models of computing
° Fast communication, mesh networks
° Parallel memories
Sequential computer architecture:
° Very long instruction word architectures
° Pipeline and functional unit organization
Chip multiprocessors/Multiprocessor system on chip:
° General purpose architectures
° Design methodologies (high-level language, VLSI)
° Power saving techniques
Network-on-chip:
• Switch architectures
• Efficient routing algorithms
• High-bandwidth solutions
Application specific MP-SOC architectures
• Baseband processing
• RF processing
Performance, area and power modeling:
° Analytical performance, area and power models
° Functional and trace-based simulations
° Benchmark kernels
• General purpose bechmarks
Advanced compiling and optimization techniques:
° TLP compiling techniques
° ILP-TLP co-optimization
General theory of computation:
° Differences of sequential and parallel computation
° Synchronous shared memory models (e.g. PRAM)
• Hybrid models
° Parallel programming languages
° Fundamental and philosophical limits of computing
Publications
|
Parallel architectures
Now that multicore chips are evolving to many core chips implying fundamental
changes both in architectures and application development. Fast, area and power
efficient parallel chip multiprocessor (CMP), multiprocessor system-on-a-chip
(MP-SOC) and parallel computer architectures that provide support for
easy-to-program parallel software development are needed more than ever. We
will give here some background to these things and describe how we take a
radical new approach to these problems.
The performance of processors has increased exponentially for 40 years as a
result of shrinking feature size and architectural improvements that have
raised the clock frequency and the number of executed instructions per clock
cycle. This has made multiprocessor designs, being hard to program and
providing at most linear speedups, virtually impractical since single processor
systems have been easy to program and they have catched the performance of
their multiprocessor counterparts quite soon. Recently, this development has
slowed down radically due to heating problems that have almost killed the
growth of clock frequencies and increasing resistance of shrinking wires that
effectively prevents raising the number of functional units and thus the number
of executed instruction per clock cycle for general purpose applications.
Processor architects have started to address these problems by decreasing the
clock frequency, supply voltage, and the size of an individual processing
element. Growth of performance is now seeked from parallelism by replicating
processing elements and trading ILP to TLP. This will not be easy since so far
the majority of general purpose computing engines, e.g. AMD Opteron, Intel
Pentium 4, IBM PowerPC G5, has been designed with maximal frequency and
sequential execution paradigm in mind and getting the full combined performance
of multiple processing cores seems to work only for independent processes.
Among the most promising technologies utilizing TLP with future silicon
technology are CMPs / MP-SOCs—integrating a number of computational resources on a single chip with sufficient
amount of fast storage and providing a design methodology emphasizing
regularity and reuse of existing designs—and NOCs—providing a standardized high-bandwidth infrastructure to communicate among
multiple computing resources on a single chip—since having a large number of fast inter-chip pins seems increasingly
difficult. While parallel easy-to-use high-performance systems have been
studied for a long time among the scientific community, typical CMP / MP-SOC
designs borrow architectural techniques from sequential computing, rely on
standard message passing libraries, like MPI, failing to provide either easy
programmability or high performance, or use SMP or CC-NUMA shared memory
providing good performance only in a domain well-behaving applications with
relatively independent parallel components, like digital signal/media
processing.
Our main contribution is the family of scalable Emulated shared memory (ESM) CMP
/ MP-SOC architectures to address both programmability and performance issues
of CMP / MP-SOC design for a wide range of general purpose applications [6, 36,
39] (see Figure 1 on the left hand side of this page and list of publications for references). The performance of ESM CMPs / MP-SOCs has turned out to be
insensitive to the degree of intercommunication [37]. An ESM CMP / MP-SOC
consists of chained multithreaded processors [3] with dedicated instruction
memory modules [4], highly interleaved data memory modules [19] and a
high-capacity sparse mesh (SM) interconnection network [33]. The key features
of CMPs / MP-SOCs are
• easy-to-program lock-step-synchronous arbitrary ordered multiprefix CRCW
PRAM-style programming model providing uniform shared memory with the single
step access latency and fine-grained machine instruction level synchronous
parallel execution,
• software-based design methodology ultimately supporting flexibility and general
purpose operation (see "Parallel application development" below) [28,34],
• homogeneous structure for simplifying design and making it easier to integrate
it as a part of larger system,
• zero overhead multithreading for hiding the memory latency, balancing
computation and memory access, and providing support for fine-grained TLP [26],
• superpipelining for clock cycle minimization,
• hazard-free interthread pipelining with multiple functional units organized as
a chain for hiding the thread switch costs and maximizing ILP within a superstep of
parallel execution [3,5,26],
• totally cacheless design for avoiding cache coherency problems and simplifying
the memory system [19],
• a physically feasible acyclic two-dimensional sparse mesh network exploiting
locality for high-bandwidth and deadlock free interresource communication,
• randomized hashing of memory words across the modules to avoid hotspots and
congestion in the network,
• wave based synchronization separating references belonging to successive
supersteps and allowing for different parts of the machine to execute different
phases of the program depending on the progress of communication,
• an advanced explicit synchronization mechanism supporting arbitrary multiple
simultaneous barrier synchronization [30], and
• highly interleaved memory modules for eliminating the speed difference between
processors and memory banks [19].
We have committed recently thorough feasibility studies of some Eclipse variants
indicating that they are indeed feasible on silicon [42,43]. The next step would
be building an FPGA prototype of an efficicient Eclipse CMP /MP-SOC
architecture.
Parallel application development
Highly productive and easy to use/migrate explicitly parallel programming
paradigms, languages, and tools build on a top of efficient parallel
architectures. We explain here shortly our approach to these problems.
Consider a computing model of ESM CMPs / MP-SOCs consisting of ILP and TLP
components: The TLP component models a fixed number of physical threads and a
uniform synchronous shared data memory with single step memory access time.
Execution happens in steps, which are transparent to the user. During a step
each thread of each processor executes an instruction. The second component is
the chained ILP programming model dividing an instruction to a fixed number of
subinstructions, which are executed in the corresponding FU in chain-like
manner sequentially. This kind of technique, chaining, allows a unit to use the
results of the preceding units so that a portion of strictly sequential
subinstruction can be executed during a single instruction [3]. This is not
possible with models based on parallel organization of FUs. We call the
obtained ability to execute multiple subinstructions during a step virtual ILP.
While the second component is inefficient alone, the components can be combined
with the first component by interthread superpipelining to a very efficient
programming model so that the slackness of a parallel application will allow
completely hazard-free superpipelined operation.
Programming happens with a high-level TLP language, like e [8], or with plain
parallel assembler. E is an experimental TLP programming language designed for
general purpose parallel computers invented by us. The syntax of the e-language
is an extension of the syntax of familiar c-language. E supports parallely
recursive and synchronous MIMD programming for various PRAM models, shared and
private variables, arbitrary hierarchical groups of threads, and synchronous
control structures. This allows a programmer to use various advanced TLP
programming techniques like data parallelism, divide-and-conquer techniques,
different blocking techniques, and both synchronous and asynchronous
programming style to express aimed functionality efficiently.
To optimize virtual ILP component in a program compiled from the high-level TLP
language one can utilize a basic block scheduling algorithm that divides each
block into memory reference blocks or chained instructions and schedules the
subinstructions of the basic block by trying to fill memory reference blocks,
one by one, by subinstructions from the successor blocks so that ordering
defined by the dependencies will not be violated [26]. In the best case the
subinstructions of a basic block are packed into a single instruction. The
virtual ILP optimizer can be placed into the back-end of the compiler allowing
one to optimize the code to multiple ILP configurations without applying the
front end part of the compiler again.
The design flow of the application development scheme consists of eight parts
(see Figure 2 on the left hand side of this page):
1. Describe computational problems as algorithms
2. Implement algorithms as programs
3. Map tasks to thread groups
4. Determine the needed hardware
5. Compile and link
6. Optimize virtual ILP
7. Implement the hardware configuration
8. Load the code to the machine
Proposals for research co-operation and positive funding decisions are welcome!
|
|
||||||
|
|
||||||||
![]() |
|
|||||||
|
|
||||||||
|
// Add a_ to elements of b_ in parallel with e
int a_; // A shared variable
int b_[size]; // A shared array of integers
int tmp_[size]; // Thread-wise copies of a_
// EREW version, O(log N) time:
int i;
// Spread a_ to tmp_ with a logarithmic algorithm
if_ ( _thread_id==0 , tmp_[0]=a_; );
for (i=1; i<_number_of_threads; i<<=1)
if_ ( _thread_id-i>=0 ,
tmp_[_thread_id]=tmp_[_thread_id-i]; );
b_[_thread_id]+=tmp_[_thread_id]
// CRCW version, O(1) time:
b_[_thread_id]+=a_;
// Sort src_ in parallel with e in O(1) time
int src_[N]; // Array to be sorted
int tgt_[N]; // Rank for each element
// Multiprefix CRCW:
int i = _thread_id >> logn;
int j = _thread_id & (N-1);
if_ ( _thread_id<N2 ,
fast_multi(MADD,&tgt_[j],src_[i]<src_[j]););
if_ ( _thread_id<N ,
src_[-target_[_thread_id]]=src_[_thread_id]; );
|
|
|||||||
|
|
||||||||
![]() |
|
|||||||
|
|
||||||||
|
|
|
|||||||
|
|
||||||||
|
Last updated July 28, 2011, MF.
|
|
|||||||
|
|
|
|
|
|
|
|
|
|
