profiling: the act, usually as part of the compilation process, of executing a program on a sample input data set to get a feel for the behavior of branches and other run-time activities. The intent is to have the compiler generate a better object file as a result of this information. Whether profiling provides value or not is highly dependent on how close the sample input data set resembles the actual input data set when the program is executed for real. We call this "closeness" representativeness. representativeness (of data): see profiling. compile-time (static): that which is done before the program is actually run. For example, a compile-time branch predictor makes its prediction based on information that is available before the program starts execution. runtime (dynamic): that which is done while the program is running. For example, a run-time branch predictor makes its prediction based on information that is collected while the program is being executed. It is important to note that in the old days, there was a clear distinction between compile time and run time. That is, the program was compiled. Then the program was executed. Today, more and more research has gone into the notion that the program should be amenable to re-compiling while it is running. That is, once the hardware knows that a decision that was made at compile time was a bad decision, the hardware ought to be able to reinvoke the compiler with this information and generate new object code that takes advantage of this new information. This concept is still in its infancy. levels of transformation: my term for the transformations that must take place in order to get a problem stated in some natural language like English, Spanish, Gujarati or Tagalog to be solved by the electrons that do the work. Problem --> algorithm --> high level or assembly language program --> object code in the ISA of the machine --> control signals that drive the microarchitecture --> logic --> circuits --> electrons. algorithm: a step by step procedure, wherein each step is definite (precisely defined), effectively computable (able to be carried out by a computer), and wherein the procedure is guaranteed to terminate (finite). For a problem to be solved with a computer, one must be able to construct an algorithm to solve it. ISA: Instruction Set Architecture. The specification of all aspects of the interface between the software and the hardware; for example, instruction formats, opcodes, data types, addressing modes, memory address space, memory addressibility, etc. This allows the software to specify a sequence of operations for the hardware to carry out (the program) and for the hardware to know what each of those instructions is asking it (the hardware) to do. For example: x86, Power-PC, SPARC, MIPS, VAX, Alpha are all ISAs. microarchitecture: a particular data path and its control that carries out the specification of an ISA. For example, the x86 ISA has been implemented by Intel by the 8086, 386, Pentium, Pentium-Pro, Pentium 4, Pentium M, etc. and by AMD by the Athlon, Opteron, etc. opcode: the part of an instruction that specifies the operation(s) that the hardware is expected to perform in carrying out this instruction. prefix property: the property that a bit pattern is a prefix of a longer bit pattern. For example, 1001 is a prefix of 1001000001111. addressing modes: mechanisms for obtaining operands required by an instruction. For example, the LDW instruction obtains the address of the memory location to be accessed by adding an offset to a Base Register. We say the addressing mode is Base+offset. The JSR instruction computes the starting address of the subroutine by adding an offset to the PC. We say the addressing mode is PC-relative. data type: a representation of information for which the ISA provides instructions that operate on that representation. For example, in the LC-3b, the ADD instruction operates on values represented as 2's complement integers. We say the 2's complement integer is a data type. There is no floating point data type in the LC-3b ISA since there is no instruction in the LC-3b that operates on values represented as floating point numbers. Note that we can still operate on floating point representations by writing a subroutine to operate on these representation. For example, if the LC-3b did have a FLOATADD instruction, the LC-3b could compute a floating point add in ONE instruction. Since it does not, one could write a procedure that isolates the exponent and fractional part of each value, compares the exponents, shifts right the smaller fraction accordingly, does the add, and renormalizes the result. However, such a procedure would take tens of LC-3b instructions. addressability: the number of bits stored at each memory location. Most modern computers are byte addressable. That is, 8 bits are stored at each location. byte-addressable: see addressability. address space: the number of distinct memory addresses. That is, the maximum number of memory locations possible such that each can be specified by a unique address. Older machines, the original x86, for example, had an address space of 2^16. We sometimes say: the x86 had 16 bits of address space. The original IBM 360, I believe, had an address space of 2^24. Most ISAs today, have an address space of 2^32, and some have an address space of 2^64. endianness (big-endian/little-endian): Suppose a 32-bit value is to be stored in a memory having byte-addressability. Four memory locations are required. We store the 32-bit value in four consecutive memory locations, say those haveing addresses A, A+1, A+2, and A+3. The question still remains: which 8 bits of the value get stored in which location. If we store bits 31:24 into location A, 23:16 into A+1, etc., we say the memory system is big-endian. If we store bits 7:0 into location A, 15:8 into A+1, etc., we say the memory system is little-endian. That is, if the number of bits required to represent a value is a multiple of the addressability of the memory, we store the value in contiguous locations of memory, starting with the low-numbered address. If we store the most significant bits at that address, we say we are storing according to the big-endian convention. If we store the least significant bits there, we say we are storing according to the little-endian convention. fixed length vs. variable length: used to describe the number of bits used to encode an entity, usually used to refer to instructions, although data can also be fixed or variable length. If fixed length, the entity is always the same size. If variable length, the size is determined by the specific entity involved. LC-3b instructions are fixed length; every one consists of 16 bits. The x86 instruction is variable length. Depending on the instruction, there may be anywhere from one to 15 bytes constituting a single instruction. uniform vs. non-uniform decode: fixed length instructions may have the property that the various fields of the instruction always occur in the same place, simplifying the decode process. We call such an encoding of the instruction format: uniform decode. The LC-3b for example has the opcode ALWAYS in bits 15:12 and the result operand always in bits 11:9. memory-mapped I/O: I/O activity can be specified by special instructions in the ISA or by using the load and store instructions, but assigning each I/O device register an address from the memory address space. The latter is referred to as memory-mapped I/O. For example, with memory-mapped I/O, keyboard input is specified by a load instruction where the address is the one assigned to the keyboard data register. operate instructions: instructions that operate on data: ADD,SUB,XOR,SHR, etc. logically complete: a set of logical primitives that allows one to specify all logical functions. NAND, for example is by itself logically complete. AND is not by itself logically complete. The set {AND, NOT}, however, is logically complete. arithmetic/logical shift: A right shift is conveniently used to perform an integer divide by 2, where the value being divided is expressed in 2's complement representation. Thus, we need to insert into the high bit position the sign bit of the value. On the other hand, bit patterns are often right-shifted for other (logical) purposes, in which case we need to insert a 0 into the the high (left-most) bit position. Given these two uses, we need to differentiate an arithmetic shift where we insert the sign bit into the left-most position from a logical shift where we insert a 0 into the left-most position. [Note that an arithmetic left-shift, corresponding to multiply by 2, requires inserting a 0 into the right-most bit position, exactly the same as what we would do for a logical left-shift. Therefore, we generally do not differentiate arithmetic left shift from logical left shift.] 2-pass assembler: A simple assembler that makes two passes at an assembly language program, first to identify all the symbols and generate the symbol table, and second, to generate the actual bit patterns of the code. pseudo-op: a directive provided by the assembly language programmer to the assembler to help the assembler translate the assembly language program to the 0s and 1s of the machine's ISA. In LC-3b assembly language, .ORIG, .FILL, and .STRINGZ are examples of pseudo-ops. Pseudo-ops are not executable at run-time. They are strictly messages to help the assembler assemble the program. label: a symbol that identifies the location of an instruction. Necessary if that location contains the targe of a branch or a location that contains data that will be accessed by a load or a store instruction. data movement instructions: instructions that move data. Loads/stores to/from memory and I/O instructions. control instructions: instructions that redirect the instruction stream. The default in instruction processing is to increment the PC to point to the next instruction. If the next instruction is other than the next sequential instruction, a control instruction is needed to provide the new PC. The LC-3b control instructions are BR, JMP, JSR(R), TRAP, and RTI. condition codes: single-bit registers (often referred to as "flags") which provide additional information about the result of the last instruction. The LC-3b has three condition codes (N,Z,P). Upon completion of an operate instruction, the condition codes specify whether the result of that operate was negative (N=1,Z=0,P=0), zero (N=0,Z=1,P=0), or positive (N=0,Z=0,P=1). Similarly, upon completion of a load, the condition codes specify whether the value loaded into the destination register was negative, zero, or positive. Most real ISAs also have C and V condition codes. C designates the carry out of an add instruction. V designates whether the previous integer arithmetic instruction resulted in an overflow condition. One example is the case where the source operands of an ADD are both positive 2's complement numbers, but the result produced by the ALU is the representation of a negative number. [Say, I have 4-bit integers. +6 = 0110, +5 = 0101. If I tell the ALU to add them, the output would be 1011, which is the representation for -5. The ALU is saying +6 plus +5 = -5. Ergo, set V.] 0, 1, 2, 3 address machine: An n address machine is an ISA where n indicates the number of operands that need to specified EXPLICITY for binary operations in the ISA. The ADD is a good example of a binary operate instruction. In a 0 address machine, none need to be specified, since the microarchitecture obtains both sources from the top of the stack, and pushes the result onto the stack. In a 3 address machine, like the LC-3b, for example, both sources and the destination ALL need to be specified explicitly (ADD R1,R2,R3). The x86 is a 2 address machine. load/store ISA: An ISA, where operate instructions can only operate on operands that are stored in the registers. That is, one can not load a value from memory and operate on that value in the SAME instruction, for example. The LC-3b is an example of a load/store ISA. The x86 is not. fetch: the process of accessing an instruction from the instruction memory. hardware interlock: a hardware signal that prevents improper behavior from occurring in the execution of instructions. For example, consider the pipelined processing of a MUL that stores to R3, immediately followed by an ADD that sources R3. At the time the ADD wishes to source R3, the value stored there is the old value, not the value that will be produced in a few cycles by the MUL instruction. A hardware interlock (called a scoreboard bit) prevents the ADD from sourcing R3 until after the MUL stores its result there. semantic gap: an old term, used to describe the closeness of fit between the instructions in a high level language and the opcodes in an ISA. For example, the VAX had an opcode AOBLEQ which added 1 to an operand, and branched if the result of the add was less than or equal to some limit operand. This instruction was clearly in the ISA to accomodate "for" loops. The semantic gap between the "for" and AOBLEQ was tiny. In the old days, there was strong interest in specifying opcodes that had a small semantic gap relative to the high level languages commonly in use at that time. That interest has waned during the past 20 years, but there is good reason why we may see it return in the future. microsequencer: the mechanism in the microarchitecture that figures out what control signals will be needed to process during the next clock cycle. control store: a private hardware structure for storing the set of control signals needed in a particular clock cycle. In a microprogrammed machine, the job of the microsequencer is to identify the location in the control store that contains the control signals for the next clock cycle. microinstruction: the set of control signals needed in one clock cycle. In a microarchitecture with a control store: the contents of one location in the control store. datapath: the part of the microarchitecture that processes the information. The register file, alu, and various muxes are all part of the data path. state machine: the structure that specifies, for each state of the machine, and each set of relevant values (opcode, interrupt present, memory ready, etc.) what set of control signals should be invoked during the next clock cycle. VLIW: Very Long Instruction Word. A processing paradigm wherein the compiler generates a long word, consisting of more than one instruction, all stored at the same PC address. At instruction fetch, all instructions in that long word are fetched, decoded, and executed concurrently in lock step. To work, it is necessary that there be no flow dependencies among the instructions in a single long word. This paradigm provides benefit if the compiler can find at compile time multiple instructions that have no flow dependencies, and therefore, can execute concurrently. The number of instructions n in the long word (VLIW instruction) is fixed. Therefore, if the compiler can not find n independent instructions to put into one VLIW instruction, it must fill the remaining slots with no-ops. compatibility: a term used to describe the ability of new implementations of an ISA to execute code written for an older implementation. Intel has been a master at making sure that every new x86 implementation is able to execute correctly all code written for all older x86 implementations. As time passes, it becomes clear that some old opcodes, addressing modes, data types are no longer useful, and some new opcodes, addressing modes, and data types are necessary. In fact, the addition of the SSE1 and SSE2 instruction set to the x86 ISA is a reflection of the new uses of that ISA. However, Intel's insistence on compatibility means that in addition to the new SSE extensions, new implementations also have to include in their microarchitecture the capability to execute the old opcodes, addressing modes, data types, etc. since some customers may want to buy a new laptop but still use their old software. pipelining: the process of carrying out instructions in an assembly line fashion. Suppose instructions are fetched one per cycle. Then in a pipelined implementation, in one cycle the fetch logic fetches an instruction, and hands it off to the decode unit to decode in the next cycle. In that next cycle, the fetch logic will be fetching the next instruction. interrupt: an external event more urgent than the program running that causes the machine to stop execution of the program that is running and turn its attention to service the external event. Keyboard interrupt is an example of a low priority interrupt. Machine check is an example of a high priority interrupt. exception: an event internal to the program that is running that must be handled to properly deal with the instruction. Page fault, access control violation are two examples of exceptions. precise exceptions: if an instruction causes an exception, it is often important to recover the machine state that corresponds to the situation where all instructions preceding the exception-causing instruction have executed correctly and no instruction following the exception-causing instruction has executed. It is also useful to identify the instruction that caused the exception. A microarchitecture that can do both is said to support precise exceptions. The Tomasulo Algorithm running on the IBM 360/91 did not support precise exceptions. critical path design: the cycle time of a processor is one of the key items that specifies the performance of the processor. Recall that performance is to a first approximation determined by 1/(length * CPI * cycle time) Decreasing cycle time has a major impact on performance. Cycle time is the maximum of the various logic paths in the data path and its control. Critical path design instructs one to examine the longest path in the machine and attempt to reduce it, either by more clever logic design or by latching a path in the middle and turning a long single-cycle operation into a two-cycle operation. bread and butter design: neither the supply of transistors on the chip nor the design time of the designer are infinite. With finite resources, bread and butter design suggests applying both to the chip's bread and butter, that is, those opcodes, addressing modes, and data types that are important to optimize for the expected usage of the chip. balanced design: the whole point of a pipelined implementation is to allow the various parts of the microarchitecture to do their respective things concurrently. It makes no sense to provide capability in one stage of the pipeline that goes unused in the next stage. Better to use the extra transistors required to provide that capability for something more useful. Example: a fetch engine that can fetch and decode 6 instructions each cycle, but only has a single ALU for executing these instructions is not a balanced design. There is no point wasting resources to be able to fetch and decode 6 instructions concurrently if the instructions just sit in a queue downstream waiting for the single ALU. SRAM: Static random access memory. An SRAM cell stores a single bit of information with two cross-coupled inverters. The feedback path maintains the state of the bit. As long as power is supplied to the two inverters, the SRAM cell maintains its value. The SRAM cell is also referred to as volatile memory, since if power is removed from the inverters, the bit that is stored is lost. [Cross-reference, non-volatile, where the bits retain their values even when power is removed, as in a disk or tape.] The minimum number of transistors for an SRAM cell is 4 -- two inverters, with 2 transistors per inverter. However, most SRAM cells use more transistors to provide for other features, such as switching speed, gating, noise immunity, for example. DRAM: Dynamic random access memory. A DRAM cell stores a single bit of information as the absence or presence of charge on a capacitor. The capacitor charged is one value, the capacitor uncharged is the other value. A single transistor is required to read (sample the charge) and write (store the charge) the bit. Unfortunately, the capacitor will not maintain its charge indefinitely. There are RC paths to ground that will leak the charge. Therefore, a DRAM cell requires periodic refresh. That is, periodically, capacitors must be sampled and returned to their fully charged or uncharged state. refresh: See DRAM cell. page mode: Typically DRAM chips are addressed in two steps. Consider the DRAM chip as containing 2^n addresses. The high order n/2 bits (Row) come in first, using RAS. This effectively preselects the contents of all addresses having the same row address bits. Then the low order n/2 bits (column) of the address comes in, using CAS. This selects the exact address from the 2^(n/2) that have identical row address bits. Access time is the sum of RAS + CAS times. If the next address has the same row address bits as the previous one, the memory controller, which keeps track of this, only needs to send the column address bits to the DRAM chip, thereby saving the time needed to latch the row address bits. We say that all addresses having the same row address bits are on the same "page," and the act of the memory controller exploiting this by only sending subsequent column address bits, we refer to as "page mode." row address strobe (RAS): see page mode. column address strobe (CAS): see page mode. chip enable: output pins of a chip float and have no effect if the chip enable bit is not asserted. parity: a coding scheme that requires only one extra bit to detect a single bit error, provided that all single bit errors are statistically independent. "Even" parity dictates that this extra bit must be set to 1 or 0 such that the entire n+1 bit transfer contains exactly an even number of 1s. In that way, if the received n+1 bits contain a single error, the number of 1s will be odd, so the fact that an error has occurred will be detected. Statistical independence of the bit errors is necessary, since if two bits are in error, parity will not detect it. ECC: error correction code. a code which allows a single bit error to be not only detected, but also corrected without requiring the bits to be retransmitted. See my handout on the subject. Hamming Code: the mechanism for performing ECC correction. See my handout. CRC check: cyclic redundancy check. A mechanism for detecting burst errors, wherein the individual bit errors are not statistically independent. memory bank: a unit of memory that can process one memory access at a time. interleaving: allocating addresses to memory locations "horizontally," rather than "vertically." That is, if address n is in bank 0, address n+k is in bank 1, address n+2k in bank 2, address n+3k in bank 3, etc. where k is the size of the access in a bank in bytes (assuming byte-addressable memory). unaligned access: an access to memory that is not aligned on an access boundary, thereby requiring two accesses to perform the desired operation. See my handout on unaligned accesses. working set: the set of pages belonging to a process that is resident in memory at an instant of time is called the process' working set at that instant of time. balance set: the set of processes that have their working sets in memory. swap in/swap out: the act of bringing in or out the working set of a process. If the working set of a process is stored back to disk, we say the process is swapped out. page: the unit of virtual memory that is transferred between disk and physical memory. That is, the whole page is brought in or out, or none of it. frame: the physical space that a virtual page occupies in physical memory. resident: a term used to describe the fact that a virtual page is actually occupying a frame of physical memory. page in/page out: the act of moving a virtual page from the disk to physical memory (page in), or from physical memory to the disk (page out). thrashing: the condition whereby there is not enough frames of physical memory to hold all the virtual pages that a process needs, resulting in an inordinate amount of time being wasted by pages coming in and going out, only to come in again, etc. PTE: page table entry. A descriptor for each page of virtual memory, containing information, including the frame of physical memory where it resides, whether it has been modified since it has been loaded from the disk, and protection information as to what levels of privilege are required to access all information stored on that page. Some ISAs include other information in each page's PTE. page table: a data structure containing the PTEs of all pages in the virtual image of the process. mapping virtual to physical addresses: the transformation of a virtual address to the physical location in memory where the information at that virtual address is actually stored. PSR: processor status register. A processor register that contains status information about the running process, including its privilege level and priority. privilege level: every ISA defines a set of privilege levels, from unprivileged to most privileged to distinguish what level of privilege is required to make certain accesses. The simplest model has two levels of privileged: privileged and unprivileged. protection: the concept of protecting a computer resource from being utilized by a process that does not have the desired level of privilege. For example, a page can be protected so that only the highest privileged processes can write to that page. Another page may be protected so that any process can write to it. Some instructions can only be executed by processes that have a certain level of privilege. For example, on machines that have a HALT instruction, unprivileged processes are generally not allowed to execute the HALT instruction ...and allowed to remotely stop the computer. virtual memory: memory that a process thinks it has. Generally, the size of the virtual address space. physical memory: memory that actually exists. address translation: conversion of a virtual address to the physical address where the virtual location actually exists, so that the processor can read/write values into that location. VPN: virtual page number. The high bits of a virtual address; it specifies the page number. PFN: page frame number. The high bits of a physical address; it specifies the frame number within physcial memory. TLB: translation lookaside buffer. It is a cache of PTEs. It allows translation to take place without going through the involved process of "walking" through the page tables. That is, within a fraction of a cycle instead of tens of cycles. valid bit: a bit determining whether other bits are accurate. In the case of the PTE, the valid bit tells whether the hardware can believe the PFN bits, which really says whether the page is resident or not. If V=0, the page is not resident, forcing the hardware to take a page fault. reference bit: a mechanism used for page replacement. If the reference bit mechanism is used for page replacement, each PTE contains a reference bit. Periodically, all reference bits in the page table are set to 1. When a page is accessed, its reference bit is cleared. When a page fault occurs, the page chosen to be kicked out to make room is taken from the set of pages whose reference bits are still 1. modify bit: a bit in the PTE which is set when a page is written to. If the Modify bit is clear, the page need not be written back to disk on replacement since it has not changed since being read from the disk. page table base register: the starting address of the page table. Used to compute the address of a PTE. limit register: the number of PTEs in the page table. Used to determine if a virtual address actually corresponds to an address in the virtual image. system space: virtual address space that is common to all processes, and is used to house the data structures and code of system related functions. user space: virtual address space specific to a process. Sometimes called process space. Each process has its own version of process space. access control violation (ACV): an exception caused by one of two things. Either the address was ill-formed and does not belong to the virtual address space of that process (limit violation) or the privilege of the process does not allow the access specified (protection violation). page fault (TNV): Translation not valid. the virtual address being translated is not resident in physical memory. segmentation: a protection mechanism to protect regions of memory for different kinds of accesses. segment register: a register used in conjunction with the address provided by the program to produce an address which is either a physical address that can be accessed directly or a virtual address which needs to be further translated to a physical address. linear address: Intel's name for virtual address. That is, a segment register provides a base address which is combined with the address provided in the program to form the linear address. The linear address is further translated to a physical address using PTEs. cache memory: a small, fast-access memory structure that stores a subset of the physical memory, thereby reducing memory access time if the desired location is present in the smaller structure. Since the cache only stores a subset of the physical memory, it must store both the addresses of the locations it contains as well as the data at those locations. The granularity of a cache entry is a called a cache line (or synonymously, a cache block). spatial locality: the property that if a location in memory is accessed, it is likely that locations close by will be accessed soon. temporal locality: the property that if a location in memory is accessed, it is likely that location will be accessed again soon. cache line (block): the granularity of storage, see cache memory. This size is generally greater than the size of a single access, due to our faith that spatial locality exists. A memory address is partitioned into three parts: its tag bits, its index bits and its offset bits. The offset bits define which location within the cache line is being accessed. The index bits are used to index into the tag store, and specify the set of locations in the cache available for storing the particular line. The tag bits are stored at the corresponding index in the tag store to verify that the actual line stored in the cache is the one desired. That is, all blocks/lines from memory having identical index bits in their addresses compete for the same entries in the cache. The tag bits are used to specify which of these many entries are actually stored in the cache. direct-mapped cache: a cache characterized by the property that if a line of memory is stored in the cache, it can only be stored in one place. fully-associative cache: a cache characterized by the property that if a line of memory is stored in the cache, it can be stored in anyplace in the cache. set-associative cache: a cache characterized by the property that a line of memory can be stored in any of k entries in the cache. We say such a cache is k-way set-associative, and the set size is k. tag/data store: The tag store is the part of the cache that keeps all the bookkeeping information associated with the blocks of memory that are actually stored in the cache. Most important of these is the identity of the actual block. The data store is the storage in the cache allocated for storing the actual information that is contained in the corresponding lines of memory. When we say a cache is of a certain size, we are referring to the size of the data store. [A system having 16MB of memory and a 64KB cache stores 64KB out of the 16MB in the cache's data store. We do not say how much tag store is required. The tag store is just management overhead to allow the cache to function properly.] cache hit/miss: on a load or a store, the cache is queried to determine if the location's information is contained in the cache. If yes, we say it is a cache hit. If no, we say it is a cache miss. compulsory miss, conflict miss, capacity miss. The three common causes of a cache miss. A compulsory miss occurs the first time the line is accessed, unless the line has been prefetched into the cache. No other way the line could have gotten into the cache. A capacity miss occurs if a line that used to be in the cache had been kicked out simply because there is not enough room in the cache. A conflict miss occurs if a line was kicked out, not because of the size of the cache, but rather because of the size of the set. That is, the cache was not full, but the set for which the line competes was full. Fully associative caches do not have conflict misses. hit ratio: the percentage of accesses that are hits. That is cache hit ratio = hits/(hits + misses) writeback/write through: a property of caches. A write through cache is one in which every write to the cache is also written to memory. In this case, memory and the cache are always consistent. On replacing a block, it is unnecessary to write that line back to memory. A write back cache is one in which writes to the cache do NOT also include writes to memory. Therefore, if a cache line is to replaced, its dirty bit is checked to see whether any writes to that line have occurred since it was brought in from memory. If the dirty bit is set, indicating that writes have occurred, then before the line can be removed from the cache. it has to be written back to memory first. dirty bit: in a write back cache, a bit that specifies whether the information in the cache line has been changed since it was read in from memory. If not, then on replacement, this line need not be written back to memory. allocate on write miss: the act of setting aside space in the cache for data to be stored, if the line written to is not in the cache. sector cache: the characteristic of a cache whereby each cache line is partitioned into sectors, whereby each sector is individually valid. Thus, if a cache line is partitioned into four sectors, there are four valid bits in the tag store entry for that line. Useful in those cases where one wishes to allocate on a write miss, but not read in the entire line from memory because it is likely that the entire line will be completely overwritten before it is ever read. replacement policy (LRU, FIFO, random, victim/next-victim): algorithm to determine which line within a set is to be evicted if a cache miss occurs, and one of the lines in the set must go. The line to be evicted is often called the victim. The line that will become the victim after the victim is evicted is the next-victim. key transformation: a fancy synonym for hashing. virtually-indexed, virtually-tagged cache: a virtually addressed cache. physically-indexed, physically-tagged cache: a physically addressed cache. virtually-indexed, physically-tagged cache: a cache wherein the index bits used to access the tag store are obtained from the virtual address, but the tag bits actually stored in the tag store are physical address bits. This allows TLB and tag store accesses to be made concurrently. cold start effect: in general, the transient effect that occurs in most things before a system is in service long enough to be operating in a stable mode. In the case of a cache, this refers to the case where the cache is completely empty and the first several accesses are compulsory misses. RAM: Random Access storage. Given an access, the time to perform the next access is independent of the location of that access relative to the preceding access. direct access storage: access is random to a starting address, and then sequential locations are accessed from that point on. Most common type is a disk. sequential access storage: access is always sequential to the next location. Most common type is a tape. associative access: access is based in part at least on the contents of the storage location. That is, part of the address is actually stored in the storage location. Two common examples that you have already seen of this are (1) the tag store of the cache, where the entry contains the tag bits, which are part of the address, and (2) the TLB where each entry is a page number and its corresponding PTE. The page number is the associative part of the access. CAM: Content addressable memory. Memory that is accessed via associative access. context switch: the act of stopping the execution of instructions from one process and starting the execution of instructions from another process. Before the processor can do that, it needs to save the registers associated with the process being stopped, and load the registers associated with the process being started. This process of saving and loading the context is referred to as a context switch. cache consistency/coherency: in a shared memory multiprocessor system, where each processor has its own cache, it is important that the contents of a single location that is stored in multiple caches has the same information in all caches that store that location. If yes, we say the cache is coherent with respect to that location. multiple levels of caches: A vast disparity exists between the fast access needed by the cache that is dealing with the processor (say, one or two cycles) and the very slow access that is available from memory (say, hundreds of cycles). It is also the case that in order for the cache to be accessible in one or two cycles, it must necessarily be small. Caches are either fast or large, they can not be both. This model would result in too many misses, resulting in too many long accesses to memory. The solution is to have multiple levels of cache: L1 is small, but takes one or two cycles, L2 might take 6 or 8 cycles but can be much, much larger. Sometimes there is an L3 cache, which is slower still and larger still. Finally, there is memory -- largest, but slowest. The notion is to extend the notion of the memory hierarchy to within the levels of cache. inclusion property: In a shared memory multiprocessor, it is important to have cache coherency across all the caches. That could mean each processor's caches monitoring the traffic associated with all the other processors' caches. It would degrade performance to have the L1 cache involved in this monitoring activity. One solution is to obey the inclusion property, which states that any location present in a processor's L1 cache is also present in that processor's L2 cache. In such cases, the L2 caches can do the monitoring for cache coherency, freeing the L1 caches from that task.