Lab Assignment 3: Pipelined Datapath with Interlocks and Forwarding

CS 572 Computer Architecture

Laboratory Assignment 3: Pipelined Datapath with Interlocks and Forwarding

Due 1 December 2017 (Friday) by 11:59 pm

Requirements

Each student should independently accomplish this laboratory assignment. You are allowed to discuss with your friends to solve the coding problems.
You must submit your partially assembled code, the source code of your simulator (see Part 1 and Part 2), a README file, and the values of total clock cycles, total instructions executed, and number of NOPs (see Part 4).

Grading Criteria

Partially assembled code for the loop example lab3c.s: 20%
A simulator for the pipelined machine pipeSim: 50%
Adhering to coding style: 10%
README file: 10%
Values of total clock cycles, total instructions executed, and number of NOPs at the end of the run (see Part 4): 10%.

0. Introduction

This assignment is a serious step forward from your simulator for multi-cycle machines. Your will have to develop a simulator for a scalar pipelined architecture. It is believed that you are familiar with the five stage MIPS pipeline described in our previous lectures. However, you are allowed to study a different architecture by creating something of your own. Please keep the following requirements in mind:

your pipeline must be a load-store, GPR-based 3-operand machine;
your pipeline must be able to execute the subset of MIPS instructions specified below;
your simulator must implement interlocks between pipe stages to recognize and resolve data hazards:

through forwarding when feasible, including forwarding through the register file
through NOPs otherwise

In this lab assignment, you are NOT required to implement branch or load delay slots. You must perform each part as described, answer all related questions, submit your programs.

Part 1. Building a Pipeline

We assume a simulator you will be building has a five-stage pipeline. The simulator is more complex than the previous ones you've developed because each instruction is only partially completed in each cycle, and because you have multiple things going on concurrently in the same clock cycle.

It should be note that you must simulate concurrent behavior somehow. There are two models you could use: perhaps the most obvious one is a "push" model, where you fetch the next instruction and use that to push the behavior of all the downstream pipeline stages. Alternatively, you may use a "pull" model, where you complete the instruction in the WB stage, emptying that stage and pulling the upstream activities down the pipe. This also has the advantage of getting register file writes done early in the cycle, which is the key to forwarding through the register file.

Regardless of the model you will be using, the stages in your simple MIPS pipeline must behave as follows:

The fetch stage is simple: just use the current value in the PC to index memory, and retrieve the instruction you find in the memory. Put it in a buffer (you could call it the IF/ID latch !) where it will be grabbed by decode in the next clock cycle. Note that you have to make sure that the decode stage doesn't get the instruction for cycle (N+1) by mistake.
Decode reads zero, one or two values out of the register file and stores them in a buffer (the ID/EX latch ?) for use by the execute stage in the next clock cycle (we're not working directly with binary encoding of the instructions, so this stage can be a little smarter than the real hardware). If the current instruction is a branch, this stage needs to decide whether or not the branch is taken, and update the PC accordingly. Please keep in mind that this updated PC shouldn't be used until the next cycle, so if you're using the "pull" model you'll need to delay the change of PC until then.
Execute behaves as necessary, either completing an R-type instruction (such as an add) or doing an address calculation for a load or a store. The result could be put in an EX/MEM latch.
The memory stage references data memory, performing a read or a write if the instruction in the stage is a load or a store, respectively. If the instruction in MEM is an R-type instruction, the MEM/WB latch should buffer the result that was being held in the EX/MEM latch.
Write-back puts the results of an R-type instruction or a load into the register file. This value should come from the MEM/WB latch.

To run the example code we'll be working with, you need ADDI, B, BEQZ, BGE, BNE, LA, LB, LI, SUBI, and SYSCALL. xxx

Part 2. Forwarding

If you take a look at the "Minimizing Data hazard Stalls by Forwarding"section (in A.2) of the textbook, you will understand that the appropriate place in your simulator to make forwarding decisions would be in the "execute" stage logic. For detailed information about forwaring and how to implement forwarding, please read section A.2 and Forwarding. For how to implement a simple MIPS pipeline, please read A.3.

Part 3. Branches and Loads

Please make sure there is always a NOP (no-op) following any branch or load instructions. If we "compile" things this way, the NOP will be fetched while the branch is in decode. The machine will head off in the correct direction in the cycle following the NOP.

Your simple five-stage MIPS pipeline resolves branches during decode, so any further speed-up has to come from doing branch prediction right at the front of the pipe, during fetch (before the machine even knows it has fetched a branch!). Correlating and multilevel predictors apply to machines with longer branch delays. They don't apply to this particular machine. However, there are a couple of cache-style mechanisms for improving the branch performance of the simple MIPS pipeline. In this lab, just focus on implementing an execution model for the pipeline that can handle data hazards.

A suggestion

You should focus on getting the pipeline running hazard-free code first. Once the pipeline is behaving correctly, save a working version of your simulator before moving on to add the forwarding logic. There will be significant part marks for just part 1.

A request

In an effort to implement your simulator, please reflect on whether there is a way to structure it so that you could configure it as less-than-complete, in the sense that you'd be able to run four partial pipelines, as well as the full up machine:

IF only
IF + ID
IF + ID + EX
IF + ID + EX + MEM
IF + ID + EX + MEM + WB (the full machine)

It turns out that in more complex machines, the downstream stages are typically responsible for most of the performance loss (CPI increasing past 1). The ability to add stages successively and then measure CPI would pinpoint the stage(s) responsible for the largest incremental performance loss(es). If you see a nice way to structure your simulator so that you can run it with successively more complete pipelines, that would point to a way to gather such data. Please note: you are NOT required to implement this. Rather, just let me know if you see a nice way to do so.

Detailed syntax:

ADD Rdest, Rsrc1, Rsrc2 ADDI Rdest, Rsrc1, Imm
B label
BEQZ Rsrc1, label
BGE Rsrc1, Rsrc2, label
BNE Rsrc1, Rsrc2, label
LA Rdest, label
LB Rdest, offset(Rsrc1)
LI Rdest, Imm
NOP SUBI Rdest, Rsrc1, Imm
SYSCALL

SYSCALL These are synchronous interrupts that are wired into the code of a program to invoke some operating system service (e.g., file open or display a message on the screen.) All you need to know is that a system call instruction is a jump, it will divert to some procedure and execute the code there until it encounters a return instruction, which will cause the flow of execution to return to the point immediately following the syscall instruction.Basically your simulator does NOT see the instructions of the body of the syscall. What the simulator is doing is as soon as it sees a syscall it performs the functionality of the body of the syscall. Because the body of the syscall is in the OS and because we do not want to simulate the OS, the simulator performs the functionality of the body of the syscall WITHOUT fetching the body of the syscall. The simulator simply fetches the syscall, does the entire functionality and charges 1 cycle for it (which is a fake because in reality you should fetched and executed the body of the syscall) and then fetches the instruction following the syscall (AS IF the syscall instruction does not change control flow and is like an ALU instruction) and so on.
Rsrc1, Rsrc2, and Rdest specify one of the general-purpose registers ($0 through $31).
Imm denotes a signed immediate (an integer).
label denotes the address associated with a label in the .text or .data segment. This is encoded as a signed offset relative to the current value of the program counter, which will be one word past the address of the current instruction (i.e. the instruction referencing the label).
offset denotes a signed offset (an immediate in the instruction) which is to be added to the value in the base register, Rbase. The result is the memory address from which a value is loaded into Rdest, or to which the value in Rsrc1 is stored.

Part 4. A Test Case

(1) You need to instrument your simulator to keep track of number of clock cycles of execution, total instruction count, and number of NOPs executed.

You should use a very simple test case to see whether your pipeline is working. You can invent other variations of this code to test things more thoroughly. As for the previous lab, you will need to hand-assemble it for execution.

(2) You should be able to take nearly the same example program you used for Lab 2 - but with NOPs inserted after each load or branch - and run it on your pipeline. This requires minimal changes to the assembly (some offsets will change due to the insertion of the NOPs), and still give you a fairly complex test case.

(3) Please try the vector addition example from the course pack, just to wring out the forwarding logic thoroughly.

One simulation-related detail: make sure your simulator runs long enough to push the last instruction in the program all the way to the end of the pipe! The simplest solution, of course, is just to add a bunch of NOPs to the end of the code (and that is acceptable), although there are certainly simple ways to do this more elegantly.

Your simulator should output total clock cycles, total instructions executed, and number of NOPs at the end of the run.

Please report the values of the total clock cycles, total instructions executed, and number of NOPs. (see Grading Criteria 5)

Lab Assignment 3 Submission

All work in this lab must be done by you alone.
You must submit the following two files
1. lab3c.s - It contains your partially assembled code for the loop example.
2. pipeSim - It contains your simulator for the pipelined machine.
3. If you have used something different than the five stage MIPS pipeline, you will need to provide sufficient hardcopy documentation so that the Grader can understand the details of the machine.
4. If you have a chance to think about a way to configure the partial pipeline, please send the Grader an e-mail.

E-mail the four files to chinmayk93@gmail.com and txie@mail.sdsu.edu with the subject line of "CS 572 Lab Assignment 3". In your Email, please indicate your name and your student ID. Note that this assignment counts 10% towards your course grade.