In section Sequential Y86-64 Implementations, we had stepped through a complete design for a Y86-64 processor. But this style of implementation does not make very good use of our hardware units, since each unit is only active for a fraction of the total clock cycle. We will see that we can achieve much better performance by introducing pipelining.
But before attempting to design a pipelined Y86-64 processor, let us consider some general properties and principles of pipelined systems.
A key feature of pipelining is that it increase the throughput of the system, but it may also slightly increase tha latency.
Let us look in some detail at the timing and operation of pipeline computations.
First, following is a unpipeleined computation hardware.
We could divide the computation performed into three stages, A, B, C:
The following picture traces the circuit activity between times 240 and 360, as instruction I1 propagates through stage C; I2 propagates through stage B; and I3 propagates through stage A.( I1, shown in dark gray; I2, shown in blue; I3, shown in light gray.)
Above picture shows an ideal pipelined system. Unfortunately, other factors often arise that diminish the effectiveness of pipelining.
- nonuniform partitioning: following picture shows a system in which we divide the computation into three stages as before, but the delays through the stages range from 50ps to 150ps, so that we must setting the clock cycle is
max(50,150,100) + 20 = 170
. - diminishing returns of deep pipelining: in the following picture, we have divided the computation into six stages, each requiring 50ps. Although throughput had improved, but doubling the number of pipeline stages, we improve the performance by a factor of 14.29/8.33 = 1.71. And the delay becomes a limiting factor.
Practices:
Up to this point, we have considered only systems in which the objects passing through the pipeline are completely independent of oen another.
For a system that executes machine programs, however, there are potential dependencies between successive instructions.
For example:
irrmovq $50, %rax
addq %rax, %rbx ; %rax had used line1
mrmovq 100(%rbx), %rdx ; %rbx had used line2
There is a data dependency between each successive pair of instructions.
Another example is control dependency:
loop:
subq %rdx, %rbx
jne targ ; CC had used the result of above line
irmovq $10, %rdx
jmp loop
targ:
halt