CS 572 Computer
Architecture
Laboratory
Assignment 3:
Pipelined Datapath with Interlocks and Forwarding
Due
1 December 2017 (Friday) by 11:59 pm
Requirements
-
Each student should independently
accomplish this laboratory assignment. You are allowed to discuss with your friends to solve the coding problems.
-
You must submit
your partially assembled code,
the source code of your simulator (see Part 1 and Part 2), a README file,
and the values of total clock cycles, total instructions executed, and
number of NOPs (see Part 4).
Grading
Criteria
- Partially assembled code for the loop example lab3c.s:
20%
- A simulator for the pipelined machine pipeSim: 50%
- Adhering to coding style: 10%
- README file: 10%
- Values of total clock cycles, total instructions executed, and number
of NOPs at the end of the run (see Part 4): 10%.
0. Introduction
This assignment is a serious step forward from your simulator for multi-cycle
machines. Your will have to develop a simulator for a
scalar pipelined architecture. It is believed that you are
familiar with the five stage MIPS pipeline described in our previous lectures.
However,
you are allowed to study a different architecture by creating something
of your own. Please keep the following requirements in mind:
- your pipeline must be a load-store, GPR-based 3-operand machine;
- your pipeline must be able to execute the subset of MIPS instructions
specified below;
- your simulator must implement interlocks between pipe stages to recognize and
resolve data hazards:
- through forwarding when feasible, including forwarding through
the register file
- through NOPs
otherwise
In this lab assignment, you are NOT required to implement branch or load delay slots. You must perform each part as
described, answer
all related questions, submit your programs.
Part 1.
Building a Pipeline
We assume a simulator you will be building has a five-stage pipeline. The
simulator is more complex than the previous ones you've developed because each
instruction is only partially completed in each cycle, and because you have
multiple things going on concurrently in the same clock cycle.
It should be note that you must simulate concurrent behavior somehow. There are
two models you could use: perhaps the most obvious one is a "push"
model, where you fetch the next instruction and use that to push the behavior of
all the downstream pipeline stages. Alternatively, you may use a
"pull" model, where you complete the instruction in the WB stage,
emptying that stage and pulling the upstream activities down the pipe. This also
has the advantage of getting register file writes done early in the cycle, which
is the key to forwarding through the register file.
Regardless of the model you will be using, the stages in your simple MIPS
pipeline must behave as follows:
- The fetch stage is simple: just use the current value in the PC to index
memory, and retrieve the instruction you find in the memory. Put it in a
buffer (you could call it the IF/ID latch !) where it will be grabbed by
decode in the next clock cycle. Note that you have to make sure that the
decode stage doesn't get the instruction for cycle (N+1) by mistake.
- Decode reads zero, one or two values out of the register file and stores
them in a buffer (the ID/EX latch ?) for use by the execute stage in the
next clock cycle (we're not working directly with binary encoding of the
instructions, so this stage can be a little smarter than the real hardware).
If the current instruction is a branch, this stage needs to decide whether
or not the branch is taken, and update the PC accordingly. Please keep in
mind that this updated PC shouldn't be used until the next cycle, so if
you're using the "pull" model you'll need to delay the change of
PC until then.
- Execute behaves as necessary, either completing an R-type instruction
(such as an add) or doing an address calculation for a load or a store. The
result could be put in an EX/MEM latch.
- The memory stage references data memory, performing a read or a write if
the instruction in the stage is a load or a store, respectively. If the
instruction in MEM is an R-type instruction, the MEM/WB latch should buffer
the result that was being held in the EX/MEM latch.
- Write-back puts the results of an R-type instruction or a load into the
register file. This value should come from the MEM/WB latch.
To run the example code we'll be working with, you need ADDI,
B, BEQZ,
BGE, BNE,
LA, LB,
LI, SUBI,
and SYSCALL. xxx
Part 2.
Forwarding
If you take a look at the "Minimizing Data hazard Stalls by Forwarding"section (in A.2) of
the textbook, you will understand that the appropriate place in your simulator to make forwarding decisions would be in the
"execute" stage logic. For detailed information about forwaring and how to implement forwarding, please read section A.2 and Forwarding. For how to implement a simple MIPS pipeline, please read A.3.
Part 3.
Branches and Loads
Please make sure there is always a NOP
(no-op) following any branch or load instructions. If we "compile"
things this way, the NOP
will be fetched while the branch is in decode. The machine will head off in the
correct direction in the cycle following the NOP.
Your simple five-stage MIPS pipeline resolves branches during decode, so any
further speed-up has to come from doing branch prediction right at the front of
the pipe, during fetch (before the machine even knows it has fetched a branch!).
Correlating and multilevel predictors apply to machines with longer branch
delays. They don't apply to this particular machine. However, there are a couple
of cache-style mechanisms for improving the branch performance of the simple
MIPS pipeline. In
this lab, just focus on implementing an execution model for the pipeline that
can handle data hazards.
A suggestion
You should focus on getting the pipeline running hazard-free code first. Once
the pipeline is behaving correctly, save a working version of your simulator
before moving on to add the forwarding logic. There will be significant part
marks for just part 1.
A request
In an effort to implement your simulator, please reflect on whether there is
a way to structure it so that you could configure it as less-than-complete, in
the sense that you'd be able to run four partial pipelines, as well as the full
up machine:
- IF only
- IF + ID
- IF + ID + EX
- IF + ID + EX + MEM
- IF + ID + EX + MEM + WB (the full machine)
It turns out that in more complex machines, the downstream stages are typically
responsible for most of the performance loss (CPI increasing past 1). The
ability to add stages successively and then measure CPI would pinpoint the
stage(s) responsible for the largest incremental performance loss(es). If you
see a nice way to structure your simulator so that you can run it with
successively more complete pipelines, that would point to a way to gather such
data. Please note: you are NOT required
to implement this. Rather, just let me know if you see a nice way to do so.
Detailed syntax:
ADD Rdest, Rsrc1, Rsrc2
ADDI Rdest, Rsrc1, Imm
B label
BEQZ Rsrc1, label
BGE Rsrc1, Rsrc2, label
BNE Rsrc1, Rsrc2, label
LA Rdest, label
LB Rdest, offset(Rsrc1)
LI Rdest, Imm
NOP
SUBI Rdest, Rsrc1, Imm
SYSCALL
- SYSCALL These are synchronous interrupts that are wired into the code of a program to invoke some operating system service (e.g., file open or display a message on the screen.) All you need to know is that a system call instruction is a jump, it will divert to some procedure and execute the code there until it encounters a return instruction, which will cause the flow of execution to return to the point immediately following the syscall instruction.Basically your simulator does NOT see the instructions of the body of the syscall. What the simulator is doing is as soon as it sees a syscall it performs the functionality of the body of the syscall. Because the body of the syscall is in the OS and because we do not want to simulate the OS, the simulator performs the functionality of the body of the syscall WITHOUT fetching the body of the syscall. The simulator simply fetches the syscall, does the entire functionality and charges 1 cycle for it (which is a fake because in reality you should fetched and executed the body of the syscall) and then fetches the instruction following the syscall (AS IF the syscall instruction does not change control flow and is like an ALU instruction) and so on.
- Rsrc1, Rsrc2,
and Rdest
specify one of the general-purpose registers ($0
through $31).
- Imm denotes a signed immediate (an integer).
- label denotes the address associated with a label in the .text
or .data segment. This is encoded as a signed offset relative to
the current value of the program counter, which will be one word past the
address of the current instruction (i.e. the instruction referencing the label).
- offset denotes a signed offset (an immediate in the
instruction) which is to be added to the value in the base register, Rbase.
The result is the memory address from which a value is loaded into Rdest,
or to which the value in Rsrc1 is stored.
Part 4.
A Test
Case
(1) You need to instrument your simulator to keep track of number of clock
cycles of execution, total instruction count, and number of NOPs
executed.
You should use a very simple test case to see whether your
pipeline is working. You can invent other variations of this code to test things
more thoroughly. As for the previous lab, you will need to hand-assemble it for
execution.
(2) You should be able to take nearly the same example program
you used for Lab 2 - but with NOPs inserted after each load or
branch - and run it on your pipeline. This requires minimal changes to the
assembly (some offsets will change due to the insertion of the NOPs),
and still give you a fairly complex test case.
(3) Please try the vector addition example from the course
pack, just to wring out the forwarding logic thoroughly.
One simulation-related detail: make sure
your simulator runs long enough to push the last instruction in the program all
the way to the end of the pipe! The simplest solution, of course, is just to add
a bunch of NOPs to the end of the code (and that is acceptable),
although there are certainly simple ways to do this more elegantly.
Your simulator should output total clock cycles, total instructions executed,
and number of NOPs at the end of the run.
Please report the values of the total clock cycles, total instructions
executed, and number of NOPs. (see Grading
Criteria 5)
Lab Assignment
3 Submission
- All work in this lab must be done by you alone.
- You must submit the following two files
- lab3c.s - It contains your partially assembled code
for the loop example.
- pipeSim - It contains your simulator for the pipelined
machine.
- If you have used something different than the five stage MIPS
pipeline, you will need to provide sufficient hardcopy documentation so
that the Grader can understand the details of the machine.
- If you have a chance to think about a way to configure the partial
pipeline, please send the Grader an e-mail.
E-mail the four files to chinmayk93@gmail.com and txie@mail.sdsu.edu
with the subject line of "CS 572 Lab Assignment 3". In
your Email, please indicate your name
and your student ID.
Note that this assignment counts 10% towards your course grade.