Booth's algorithm: an improvement over the standard shift and add algorithm
for multiplication by enabling each iteration to consume two bits of multiplier
at each step, rather than one.

residue: another word for remainder

modulus: In a modulo scheme of any sort, the number of distinct values.  In
practical matters, if I say "A modulo p," I am asking what is the remainder
when I divide the value A by the modulus p.  Example: 10 modulus 3 = 1.

pairwise relatively prime: A number of values are pairwise relatively prime
if the greatest common diviser of every pair = 1.  For example, 7,8, and 9 are
relatively prime since the gcd of 7,8 is 1, gcd of 7,9 is 1, and gcd of 8,9
is 1.  Note that 8 and 9 are not prime.

Chinese remainder theorem: That if I have a set of k moduli, p1,p2,...pk that
are pairwise relatively prime, and if I represent each value as the set of k
residues, modulo their corresponding moduli, then the residues of the 
values 1 to the product of the k moduli are all distinct. 

floating point: a data type supported by the ISAs of most computers.  
It allows larger and tinier numbers than is possible with an equal number
of bits using a fixed point representation.  This is accomplished by using
some bits to express the size of the value (exponent bits) at the expense 
of the number of bits of precision.  A magnitude is represented by a 
fractional part and an exponent part, where the fractional part contains 
the significant bits of precision, and the exponent part specifies the size.  
A value can be expressed in floating point as either a normalized number or, 
if it is too small to be expressed as a normalized number, it can be expressed 
as a "subnormal number."  Normalized numbers are of the form 
(-1)^sign * 1.fraction * radix^exp, where the 64 bit format in IEEE arithmetic 
specifies the bits as follows: [63:63] is the sign bit, [62:52] is the 
exponent field, and [51:0] is the fraction field.  If the value is too small 
to be represented as a normalized number, it can still be represented in the 
form (-1)^sign * 0.fraction * radix^smallest_exp, where [63:63] is the sign 
bit, [62:52] contains the smallest exponent, and [51:0] is the fraction field.  

signed magnitude: the representation of a value that uses one bit for sign and
the remaining bits for magnitude.  Used in floating point data type.

exponent: the bits used to describe the size of the datum.  See floating point.

mantissa (aka significand or fraction): the part of the representation that 
contains the significant bits of the value.

normalize: to put into a standard (i.e., "normal") form.  In the case of
IEEE Floating Point arithmetic, that form is 1.<<fraction>> x 2^<<exponent>>.

radix: the base, as in 2^<<exponent>> ---> 2 is the radix.

excess-code (bias): A code that uses an excess or bias to compute its code
words.  To obtain a code word, we add the bias to the value to be encoded.
For example, with an excess-7 code, the value -1 would be expressed in eight
bits (for example) as 00000110.

binade: the interval from 2^n to 2^(n+1).

wobble: the ratio of error introduced by rounding when the result of rounding
is a power of the radix.  For radix 2, one ULP larger than 2^n is twice as 
large as one ULP smaller than 2^n.  Therefore the wobble is 2.

precision: the number of bits of significance in the representation of a value.
For 52 bits of fraction, the value has 53 bits of precision.

ULP (unit in the last place): the value of 0.000..001 x 2^<<exponent>>.
For example, with 4 bits of fraction, <<exponent>> = -2, one ULP is 1/64.

underflow: an exception caused by a computed value being too small in 
magnitude to be expressed other than by 0.

overflow: an exception caused by a computed value being too big to be expressed
by any normalized finite value.  Rounding will result in the value being 
represented by infinity or by the largest number to be represented exactly, 
depending on the rounding mode used.

chopping: one of the four rounding modes.  Rounding is accomplished by removing
(chopping) the least significant bits.  Both positive and negative values 
are rounded toward zero.

round up: one of the four rounding modes.  Rounding is accomplished by rounding
toward the next larger value.  All values are rounded toward + infinity.

round down: one of the four rounding modes.  Rounding is accomplished by 
rounding toward the next smaller value.  All values are rounded toward 
-infinity.

round to unbiased nearest (beautiful rounding): one of the four rounding modes.
Rounding is accomplished by rounding to the next value (greater or smaller)
that is closer in magnitude to the value being rounded.  If equidistant, then
we round to the value (of the two) that has a 0 in its ULP.

error due to underflow: amount of error caused by representing a value by 0.

error due to inexact: amount of error caused by rounding.

gradual underflow: the result of introducing subnormal numbers into the
representation scheme so that the error due to underflow is equal to the
error due to rounding as values approach 0.  Subnormal numbers are numbers
too small to represent as normalized numbers.  Consider the smallest 
normalized number, we call it 1.000..000 x 2^n.  The binade delimited by
2^n and 2^(n+1) contains 2^k exact values, where k is the number of bits
of fraction.  This binade is exactly the same size as the interval between
0 and 2^n.  The subnormal numbers are those values 0.<<fraction>> x 2^n.
Since the fraction consists of k bits, there are exactly 2^k exact values
in this interval.  Since this interval is the same size as the smallest
binade, the error due to underflow is the same as the error due to rounding.
Professor W. Kahan (the inventor of IEEE Floating Point Arithmetic) gave 
this notion the name "gradual underflow."

subnormal: see gradual overflow.

floating point exception: there are five things that can cause an exception:
underflow, overflow, inexact, divide by 0, and invalid.  Exceptions can be
quiet, in which case a sticky bit is set which can only be cleared by testing
for it (usually used for inexact) or signalling, in which case the exception
is taken immediately (always used for invalid).  

divide by 0 exception: an exception caused by a computation that uses 
finite arguments and produces infinite results.  Called "divide by 0" because 
the simplest example of this is a finite number divided by 0.  Another example 
of this is the tangent function where the argument is 90 degrees.

invalid exception: an exception caused by an operation that produces nonsense
results.  Examples include 0/0, infinity/infinity, sqrt (-n), arcsin(2).

NaN (not a number): the result of an operation that creates nonsense.  see
invalid exception.

SIMD (data parallel): processing that involves a single instruction stream
such that each instruction operates on multiple data sets.  Two forms of
SIMD processing exist: array processors and vector processors.

SPMD (single procedure multiple data set): a variation of SIMD, wherein the
level of synchronization is a procedure rather than each instruction.

SISD: single instruction stream operating on a single data set.

MIMD: multiple instruction streams each operating on its own data set.  The
classic form of a multiprocessor or multicomputer.

vector processor: a processor containing vector instructions, i.e., 
instructions that operate, in turn, on multiple values.

array processor: a set of processors such that all execute the same program,
each executing the identical instruction in lock step, each step of the way. 

granularity of concurrency: the scope of what is being processed simultaneously. 
Granularity can be at the intrainstruction level, as in pipelining, or at the
interinstruction level, as in vliw, or at the processor level, as in a 
multiprocessor.

multiprocessor: more than one processor operating concurrently.  The pure
definition of a multiprocessor requires that each processor carry on 
substantial work by itself (to differentiate it from a co-processor), and 
that all the processors share the same memory space.  Sometimes this form 
of multiprocessor is referred to as a tightly-coupled multiprocessor.  
...as compared to a loosely-coupled multiprocessor where there is no shared 
memory.  The more proper term for a loosely-coupled multiprocessor is 
multicomputer network.  But most people refer to the two paradigms as tightly 
and loosely coupled multiprocessors.  In a shared memory multiprocessor, values
can be passed from one to the other via shared memory locations.  In a 
loosely-coupled multiprocessor, they are passed via message passing.

tightly coupled MP: see multiprocessor.

shared global memory: see multiprocessor.

loosely coupled MP: see multiprocessor.

message passing: see multiprocessor.

superscalar: a microarchitecture paradigm where more than one instruction is
fetched and decode each cycle, the exact number determined by the fetch/decode
engine, that number dependent on the actual instructions fetched.

IPC (instructions per cycle): a common metric of performance.  The number of
instructions completing execution on average per cycle.

Flynn's bottleneck: Michael J. Flynn's observation that as long as one fetches
instructions one per cycle into the pipeline, the machine will never obtain an 
IPC greater than 1.
 
superpipelined: a term used to describe a pipeline wherein the instruction
fetch rate is a multiple of the rate at which instructions can be processed
by one of the functional units in the pipeline.

scalability: a measure of the ability of a design to produce linearly 
improving benefits as the number of elements in the design continues to
increase linearly.  For example, if we add k ALUs, do we get k times the 
performance.  If we increase the frequency by x, do we compute the answer 
x times faster.

speedup: a measure of the improvement in performance achieved by adding
processing capability to a problem.  In its pure form, 

		Speedup(p) = Time(1)/Time(p),

where Time(1) is the time a single processor requires to compute the answer
using the best available algorithm for a single processor, and Time(p) is
the time the processor requires to compute the answer using p processors.

baseline: in any measurement where we are trying to evaluate the benefit of a 
feature, the baseline is the system that does not include the feature.  Note: 
Some less than honest purveyers will quote benefits due to a feature by 
selecting a very wimpy baseline.  In the case above which discusses Speedup 
with p processors, a poor baseline would be to use an algorithm for one 
processor that is a particularly bad algorithm if we were restricted to one 
processor.  Thus we would have a large Time(1), which would make Time(p) look 
better than it really is.

Amdahl's Law: an observation by Gene Amdahl that most problems have a parallel
part and a sequential part, such that the addition of p processors reduces the
execution time of the parallel part by p, but does nothing to improve the
execution of the sequential part.  We call the sequential part, for which the
p processors can do nothing, the sequential bottleneck.  If we let a = the 
part of the problem that is the parallel part, then we can state Amdahl's Law 
in terms of speed up:

	Speedup(p) = Time(1)/((a x Time(1))/p + (1-a) x Time(1))

sequential (serial) bottleneck: see Amdahl's Law.
parallelizable: see Amdahl's Law.  That part that can be executed in parallel.

NUMA (non-uniform memory access): a distributed shared memory multiprocessor, 
wherein processors can access all of memory, but the access time varies 
depending on which part of the distributed memory a processor is accessing.

memory contention: the result of more than one processor attempting to access
the same memory bank at the same time.

cache coherence problem (cache consistency): in a shared memory multiprocessor,
it is most common for each processor to have its own private cache.  If more 
than one processor have in their respective caches the contents of the same 
memory location, it is important that those respective contents be identical.  
We say that caches are coherent when this is the case.  Caches are not 
coherent, for example if both P1 and P2 cache location A, but P1 has one
value for A in its cache, and P2 has a different value for A in its cache.

fault tolerance: the ability of a processor to compute correctly even though
there are some faults in the hardware.

high availability: a more recent term for fault tolerance.  Availability 
refers to the fraction of time a computer is processing correctly, and not 
"down."  A high availability machine is one that is "down" for no more than 
a very few minutes a year.  To accomplish such, the computer system must be 
able to operate in the presence of faults; therefore, the term: fault tolerant.

TMR (triple modular redundancy): one mechanism for achieving fault tolerance.
Three machines execute the program in lock step, and continually vote on the 
results they produce.  The probability that two are both wrong and give the 
same wrong answer is sufficiently remote that processing continues correctly 
as long as two give the same answer.  When one of the three gives a different 
answer from the other two, it is assumed to be broken.  It is replaced with a 
known good machine, and the system continues.  The system is continually 
available as long as the other two machines give identical answers during the 
period where the broken machine is replaced.  We call the frequency of faults 
in a processor the MTTF (mean time to failure), and the time required to 
replace the faulty machine with a good one MTTR (mean time to repair).  As long
as MTTF is greater than MTTR, we should do just fine.

pair and spare: another scheme for maintaining error-free operation in the
presence of single processor failures.  Two pairs of processors, all four
executing.  Each pair compares results with its partner.  As long as all four
give the same answer, we are fine.  When the partners of one pair disagree, 
we enter vulnerable mode, where we use the other pair for computation while
we fix the failing pair.  Again, MTTF and MTTR tell us how good our scheme is. 

MTTR (mean time to repair): see TMR

snoopy cache: a mechanism for maintaining cache coherency.  This is done by
allowing each cache to monitor the accesses of other processors to memory so 
that it knows whether a particular cache line is exclusive to itself or shared 
with other caches.  ...or is being written by another processor and so it (this 
cache) must either invalidate its copy or update it with the value being 
written by the other processor.

VLIW (very large instruction word): a paradigm wherein the fetch/decode 
mechanism fetches n instructions from the instruction memory each cycle, and
executes the n instructions concurrently.  The value n a fixed value for the 
machine.  The set of n instructions fetched in one cycle, that is the contents 
of one vliw word, is specified by the compiler.  In order for the n instructions
to execute concurrently, they must all be independent with respect to flow 
dependencies.  It is the job of the compiler to identify n instructions that
can be packaged together as a single vliw word.

lock step: a characteristic of a vliw machine is that the n instructions are
executed by n processing elements in lock step, that is, at the same time.
Further, if one processor finishes before the others, it can not fetch its
next instructions until all have finished and the next fetch is done at the
same time for all n processing elements.

decoupled access/execute (DAE): a relaxation of the vliw paradigm wherein the
processors do not have to execute their instructions in lock step.

EPIC (explicitly parallel instruction computer): a modern variation of vliw,
wherein additional bits (called template bits) are provided to specify which 
instructions in a very long word can be operated in lock step, and which have 
interinstruction dependencies, and therefore, can not be.

template bits: see EPIC. 

trace scheduling: a profiling technique used with VLIW machines to find branches
that are highly skewed so as to schedule some of the instructions in a vliw 
word by violating control dependencies.  Since 98% taken (a highly skewed
branch) is not the same as 100% taken, additional instructions have to be 
provided to prepare the machine to undo instructions that should not have been
executed if the branch behaved according to the other 2%, and still additional 
instructions to actually do the undo when that "bad" branch behavior occurs.  
The benefit of this technique is that most of the time the branch behaves 
according to the 98% behavior and so additional paralllelism can be exploited.
The idea is that the undo time is small and is more than compensated for the 
extra parallelism 98% of the time.

restricted data flow: my word for the data flow graph produced by generating
it (a la Tomasulo) at run time.  A compile time data flow representation of
a program would be an enormous data flow graph.  Generation at run time 
restricts the data flow graph to only those instructions in the active window.
The active window is the set of instructions that have been fetched but not
retired yet.

active window: see restricted data flow.

dataflow node: an operation in a data flow graph.

fine-grain dataflow: a dataflow graph wherein each node performs a small 
granularity operation, like a single multiply or a single add.

coarse-grain dataflow: a dataflow graph wherein each node performs a larger
granularity function, like perhaps an entire high level assignment statement,
or an iteration of a loop body.
 
copy node: a node which takes a value as input and produces two copies of
that value as output.  A consequence of pure data flow as an abstraction
wherein each input value to a data flow node is consumed when the node 
executes.

tokens: another name for values that are input to data flow nodes or results
that are output from data flow nodes.

firing rule: the rule for when a data flow node can be scheduled for execution.
Safe vs. queues.

queue dataflow: wherein a node fires when all inputs have tokens.  The outputs
produced are queued, waiting for their subsequent use.

safe dataflow: wherein a node fires when all inputs have tokens, AND there is
an available slot for the result.  Pro: Queues are not required.  Minus: the
node is kept from firing until an output slot is available, even though it
possibly could have fired earlier.

conditional nodes: a node having two inputs and two outputs, the outputs being
labeled true and false.  This node takes two tokens (a value x and a boolean 
value) and passes the value x along the output path corresponding to whether
the boolean token is true or false. 

relational nodes: a node having two inputs and one output, the output being
a boolean.  The node performs the relation operation on its two inputs and
produces the output token TRUE if the relation is true, or FALSE if the relation
is false.

barrier synch: a node having k inputs and k outputs, such that when all inputs 
have tokens, the node fires, producing on each output line the token that was
on its corresponding input line.

register spill code: that part of an object program generated by a compiler
that writes values back to memory because there are not enough registers in
the microarchitecture to accomodate all the temporaries required of the high
level program.

live-in: a value that exists prior to the entity under consideration.  With 
respect to a unit of code, a live-in is a value that is a source for one of
the instructions in the unit of code.

live-out: a value that is produced in the unit of code and is required as
a source by code subsequent to the unit of code.

in-order retirement: instructions complete their execution (as far as the
software knows) in program order -- that is, the same order that the program
dictates.

re-order buffer (result buffer): a structure in the microarchitecture that
can hold results that are executed out-of-order so these results are still
returned to the software in program order.

multithreading: a processor that allows multiple threads to be in various
stages of execution at the same time.  A thread is another word for an 
instruction stream.  The HEP (circa 1978) is the first machine to implement 
multithreading.  Each cycle an instruction is fetched from a different 
thread in a round-robin fashion.  See my handout on the HEP.

simultaneous multithreading (SMT): a variation of multithreading in which the
execution stage of the pipeline executes out-of-order.  The result is that
multiple functional units can be simultaneously executing instructions from
multiple threads.

interconnection network: a structure that interconnects various elements 
(usually processors and units of memory) of a computer system.

hypercube (cosmic cube). An example of an interconnection network.  The
structure is a boolean hypercube.  That is if n is the number of dimensions,
and each dimension has two values (0 and 1), there are 2^n connection points
in the network.  Each connection point is connected to n other connection
points.  Each connection point is at most n hops from every other connection
point.  See my handout on interconnection networks.

bus: an interconnection network where every element is connected to the same
wire.  Only one transaction can occur on the bus at any point in time.  This
is the structure of worst case contention.  See my handout on interconnection 
networks.

full cross-bar switch: a set of buses, such that there are multiple 
transactions that can occur, one on each bus, at the same time.  See my 
handout on interconnection networks.

tree: a connection network that forms a binary tree.  See my handout on 
interconnection networks.

mesh: a connection network that forms a two dimensional array.  See my 
handout on interconnection networks.

contention: one of the three common metrics for goodness of an interconnection
network.  Contention is a measure of how many independent transfers can be
made on the network simultaneously. 

cost: one of the three common metrics for goodness of an interconnection
network.  Cost is a measure of the number of connection points in the network.

latency: one of the three common metrics for goodness of an interconnection
network.  Latency is a measure of the number of hops through the network a
transfer has to make from source to destination.