EE 360N - Study Questions

Department of Electrical and Computer Engineering

The University of Texas at Austin

EE 360N, Fall 2004
Study Questions (covering some of the topics covered in class after Problem Set 5)
Due date: Not to be turned in
Yale N. Patt, Instructor
Aater Suleman, Huzefa Sanjeliwala, Dam Sunwoo, TAs

These questions are to aid you in your studies. They are not to be turned in and they do not cover all the topics covered in class after Problem Set 5.

Suppose we have the following loop executing on a pipelined LC-3b machine.
```
         
DOIT     STW   R1, R6, #0
         ADD   R6, R6, #1
         AND   R3, R1, R2
         BRz   EVEN
         ADD   R1, R1, #3
         ADD   R5, R5, #-1
         BRp   DOIT
EVEN     ADD   R1, R1, #1
         ADD   R7, R7, #-1 
         BRp   DOIT
```
Assume that before the loop starts, the registers are initialized to the following integer values:
R1 <- 0
R2 <- 1
R5 <- 5
R6 <- 4000
R7 <- 5
"Fetch" takes 1 cycle, "Decode" takes 1 cycle, "Execute" stage takes variable number of cycles depending on the type of instruction (see below), and "Store Result" stage takes 1 cycle.

All execution units (including the load/store unit) are fully pipelined and the following instructions that use these units take the indicated number of cycles:

STW: 3 cycles
ADD: 3 cycles
AND: 2 cycles
BR : 1 cycle
For example, the execution of an ADD instruction followed by a BR would look like:
```
ADD       F | D | E1 | E2 | E3 | ST
BR            F | D  | -  | -  | E1  | ST
TARGET                                 F  | D
```
This figure shows several things about the structure of the pipeline:
- Whenever possible, data forwarding is used. Instructions that are dependent on the previous instructions can make use of the results produced before right after the previous instruction finishes the "Execute" stage.
- Branch instructions require 1 "execute" cycle to resolve the branch. Hence, the target instruction can be fetched when the BR instruction is in ST stage.
Also, you are given the following information about the pipeline and the ISA:
- The pipeline implements "in-order execution". A scoreboarding scheme is used as discussed in class.
- The pipeline emulates the LC-3b ISA. Hence, the above instructions are all LC-3b instructions.
Answer the following questions:
From Tanenbaum, 4th edition, Appendix B, 4.
The following binary floating-point number consists of a sign bit, an excess 63, radix 2 exponent, and a 16-bit fraction. Express the value of this number as a decimal number.
0 0111111 0000001111111111

From Tanenbaum, 4th edition, Appendix B, 5.
To add two floating point numbers, you must adjust the exponents (by shifting the fraction) to make them the same. Then you can add the fractions and normalize the result, if need be. Add the single precision IEEE floating-point numbers 3EE00000H and 3D800000H and express the normalized result in hexadecimal. ['H' is a notation indicating these numbers are in hexadecimal]

From Tanenbaum, 4th edition, Appendix B, 6.
The Tightwad Computer Company has decided to come out with a machine having 16-bit floating-point numbers. The model 0.001 has a floating-point format with a sign bit, 7-bit, excess 63 exponent and 8-bit fraction. Model 0.002 has a sign bit, 5-bit, excess 15 exponent and a 10-bit fraction. Both use radix 2 exponentiation. What are the smallest and largest positive normalized numbers on both models? About how many decimal digits of precision does each have? Would you buy either one?

In an Omega network as presented in class, assume that there are n inputs and n outputs. Let k be the size of each switch. For k taking the values 2, 4, 8, and 64, answer the following questions. (Assume the cost of each switch is k^2)
We have got the following expression to compute:
```
    a*x^6 + b*x^5 + c*x^4 + d*x^3 + e*x^2 + f*x + g 
```
- How many operations and time-steps will the computation take on a single processor system (Use the smallest number of operations possible)?
- How many operations and time-steps will the computation take on a multiprocessor system with 4 processors? (Use the smallest number of operations possible)
- What is the speedup of the multiprocessor system over a single processor?
The state diagram for the Goodman cache consistency scheme makes one assumption about the size of the cache blocks. What is it? (Hint: Focus on the case in which a block is in the DIRTY state and a BW signal comes in. Where do we go? Why?) If that assumption is not made, what will be the change in the state diagram? Draw the new state diagram.