CS 572 Computer Architecture

Laboratory Assignment 3: Pipelined Datapath with Interlocks and Forwarding

Due 1 December 2017 (Friday) by 11:59 pm

Requirements

  1. Each student should independently accomplish this laboratory assignment. You are allowed to discuss with your friends to solve the coding problems.

  2. You must submit your partially assembled code, the source code of your simulator (see Part 1 and Part 2), a README file, and the values of total clock cycles, total instructions executed, and number of NOPs  (see Part 4).

Grading Criteria

  1. Partially assembled code for the loop example lab3c.s: 20%
  2. A simulator for the pipelined machine pipeSim: 50%
  3. Adhering to coding style: 10%
  4. README file: 10%
  5. Values of total clock cycles, total instructions executed, and number of NOPs at the end of the run (see Part 4): 10%.

 0. Introduction                                                                                                                              

This assignment is a serious step forward from your simulator for multi-cycle machines. Your will have to develop a simulator for a scalar pipelined architecture. It is believed that you are familiar with the five stage MIPS pipeline described in our previous lectures. However, you are allowed to study a different architecture by creating something of your own. Please keep the following requirements in mind:
In this lab assignment, you are NOT required to implement branch or load delay slots. You must perform each part as described, answer all related questions, submit your programs.

 Part 1. Building a Pipeline                                                                                                            

We assume a simulator you will be building has a five-stage pipeline. The simulator is more complex than the previous ones you've developed because each instruction is only partially completed in each cycle, and because you have multiple things going on concurrently in the same clock cycle.

It should be note that you must simulate concurrent behavior somehow. There are two models you could use: perhaps the most obvious one is a "push" model, where you fetch the next instruction and use that to push the behavior of all the downstream pipeline stages. Alternatively, you may use a "pull" model, where you complete the instruction in the WB stage, emptying that stage and pulling the upstream activities down the pipe. This also has the advantage of getting register file writes done early in the cycle, which is the key to forwarding through the register file.

Regardless of the model you will be using, the stages in your simple MIPS pipeline must behave as follows:
  1. The fetch stage is simple: just use the current value in the PC to index memory, and retrieve the instruction you find in the memory. Put it in a buffer (you could call it the IF/ID latch !) where it will be grabbed by decode in the next clock cycle. Note that you have to make sure that the decode stage doesn't get the instruction for cycle (N+1) by mistake.
  2. Decode reads zero, one or two values out of the register file and stores them in a buffer (the ID/EX latch ?) for use by the execute stage in the next clock cycle (we're not working directly with binary encoding of the instructions, so this stage can be a little smarter than the real hardware). If the current instruction is a branch, this stage needs to decide whether or not the branch is taken, and update the PC accordingly. Please keep in mind that this updated PC shouldn't be used until the next cycle, so if you're using the "pull" model you'll need to delay the change of PC until then.
  3. Execute behaves as necessary, either completing an R-type instruction (such as an add) or doing an address calculation for a load or a store. The result could be put in an EX/MEM latch.
  4. The memory stage references data memory, performing a read or a write if the instruction in the stage is a load or a store, respectively. If the instruction in MEM is an R-type instruction, the MEM/WB latch should buffer the result that was being held in the EX/MEM latch.
  5. Write-back puts the results of an R-type instruction or a load into the register file. This value should come from the MEM/WB latch.
To run the example code we'll be working with, you need ADDI, B, BEQZ, BGE, BNE, LA, LB, LI, SUBI, and SYSCALL. xxx

 Part 2. Forwarding                                                                                                                         

If you take a look at the "Minimizing Data hazard Stalls by Forwarding"section (in A.2) of the textbook, you will understand that the appropriate place in your simulator to make forwarding decisions would be in the "execute" stage logic. For detailed information about forwaring and how to implement forwarding, please read section A.2 and Forwarding. For how to implement a simple MIPS pipeline, please read A.3.

 Part 3. Branches and Loads                                                                                                        

Please make sure there is always a NOP (no-op) following any branch or load instructions. If we "compile" things this way, the NOP will be fetched while the branch is in decode. The machine will head off in the correct direction in the cycle following the NOP.

Your simple five-stage MIPS pipeline resolves branches during decode, so any further speed-up has to come from doing branch prediction right at the front of the pipe, during fetch (before the machine even knows it has fetched a branch!). Correlating and multilevel predictors apply to machines with longer branch delays. They don't apply to this particular machine. However, there are a couple of cache-style mechanisms for improving the branch performance of the simple MIPS pipeline. In this lab, just focus on implementing an execution model for the pipeline that can handle data hazards.

A suggestion

You should focus on getting the pipeline running hazard-free code first. Once the pipeline is behaving correctly, save a working version of your simulator before moving on to add the forwarding logic. There will be significant part marks for just part 1.

A request

In an effort to implement your simulator, please reflect on whether there is a way to structure it so that you could configure it as less-than-complete, in the sense that you'd be able to run four partial pipelines, as well as the full up machine:

It turns out that in more complex machines, the downstream stages are typically responsible for most of the performance loss (CPI increasing past 1). The ability to add stages successively and then measure CPI would pinpoint the stage(s) responsible for the largest incremental performance loss(es). If you see a nice way to structure your simulator so that you can run it with successively more complete pipelines, that would point to a way to gather such data. Please note: you are NOT required to implement this. Rather, just let me know if you see a nice way to do so.

Detailed syntax:
ADD Rdest, Rsrc1, Rsrc2
ADDI  Rdest, Rsrc1, Imm

B  label
BEQZ  Rsrc1, label
BGE  Rsrc1, Rsrc2, label
BNE  Rsrc1, Rsrc2, label
LA  Rdest, label
LB  Rdest, offset(Rsrc1)
LI  Rdest, Imm
NOP
SUBI  Rdest, Rsrc1, Imm

SYSCALL

 Part 4. A Test Case                                                                                                                        

(1) You need to instrument your simulator to keep track of number of clock cycles of execution, total instruction count, and number of NOPs executed.

You should use a very simple test case to see whether your pipeline is working. You can invent other variations of this code to test things more thoroughly. As for the previous lab, you will need to hand-assemble it for execution.

(2) You should be able to take nearly the same example program you used for Lab 2 - but with NOPs inserted after each load or branch - and run it on your pipeline. This requires minimal changes to the assembly (some offsets will change due to the insertion of the NOPs), and still give you a fairly complex test case.

(3) Please try the vector addition example from the course pack, just to wring out the forwarding logic thoroughly.

One simulation-related detail: make sure your simulator runs long enough to push the last instruction in the program all the way to the end of the pipe! The simplest solution, of course, is just to add a bunch of NOPs to the end of the code (and that is acceptable), although there are certainly simple ways to do this more elegantly.

Your simulator should output total clock cycles, total instructions executed, and number of NOPs at the end of the run.

Please report the values of the total clock cycles, total instructions executed, and number of NOPs. (see Grading Criteria 5)

 Lab Assignment 3 Submission                                                                                                      

E-mail the four files to chinmayk93@gmail.com and txie@mail.sdsu.edu with the subject line of "CS 572 Lab Assignment 3". In your Email, please indicate your name and your student ID. Note that this assignment counts 10% towards your course grade.