CSC373: Computer Systems 1 Summer 1, 2009 Homework 3 *** Answers are put below the questions. ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;; Please answer each of the following and show your reasoning. 1. Cache C has 16 sets and is direct mapped, that is, it has one block per set. Assume that the block holds only 1 word. In short, we have 16 sets, each with one 8-bit block (or, if you like, one word per block). Process P generates this reference string, each a word address: 1, 4, 8, 5, 20, 17, 19, 56, 9, 11, 4, 43, 5, 6, 9, 17, 9, 56, 9 C is initially empty. For each memory reference above, label it as a Hit or a Miss. Show C's contents after the last memory reference. Assume that a word's cache address is its memory address modulo the cache's size in sets (for example, the word at address 1 has cache address 1 % 16 = 1; the word at address 19 has cache address 19 % 16 = 3; etc.) So the word at address 1 goes into set 1, and so on. **** Solution: I've broken the reference string into two lines for readability. Above each memory address is H for HIT or M for MISS and below each is the set to which it maps. A * (for example, S1*) indicates that there's an explanation given below. Sets: S1, S2,...,Sa, Sb, Sc, Sd, Se, Sf, Sg Reference string (in two lines for readability): M M M M M M M M M M 1, 4, 8, 5, 20, 17, 19, 56, 9, 11, S1 S4 S8 S5 S4 S1 S3 S8 S9 Sb M M M M H H H H H 4, 43, 5, 6, 9, 17, 9, 56, 9 S4* S5 S5* S6 S9 S1 S9 S8 S9 Explanations: S4* miss in second row: Address 20 in the first row also maps to S4, displacing 4; hence, the second reference to 4 is also a MISS. S5* miss in second row: Adddress 43 in the second row maps to S5, displacing 5; hence, the second reference to 5 is also a MISS. 2. Use the same reference string as in question (1). However, this time assume that there are 4 sets and that C is 4-way set associative: there are 4 sets, each with 4 blocks. A block still holds 1 word. The set address is now the reference string number modulus 4. For instance, the first number in the reference string is 1; hence, the word at address 1 goes into set 1 % 4 == set 1. Do the same as in (1), that is, label each word reference as a cache Hit or Miss and show the cache's contents at the end. To begin, we assume the cache is empty. For the blocks per set, you can label them A, B, C, and D. Start with A, if it's available; then go to B; and so on. If a block is full, pick the next empty one. If all are full, start with the next in line (A, B, C, and D is the order). **** Solution: Now there are four blocks in each set, depicted here as array elements indexed from A throught D/ Si +----+----+----+----+ | | | | | +----+----+----+----+ A B C D I'm assuming that the words initially fill in left to right (index A, index B, index C, index D). I maintain a pointer to the last access to determine which to kick out if the set is full. For example, suppose that Si is full and last access was to position [C]. Then the next entry would go into [D], the next into [A], and so on. There are only four sets: S0, S1, S2, S3. (I'm numbering them this way for simplicity but any consistent scheme is OK.) Below each I now put the set and the position. For instance, S1[C] means "S1, index position == block C." M M M M M M M M M M 1, 4, 8, 5, 20, 17, 19, 56, 9, 11, S1[A] S0[A] S0[B] S1[B] S0[C] S1[C] S3[A] S0[D] S1[D] S3[B] H M H M H H H H H 4, 43, 5, 6, 9, 17, 9, 56, 9 S0[A] S3[C] S1[B] S2[A] S1[D] S1[C] S1[D] S0[D] S1[D] Here's a C program that simulates the problem, together with the output. In this program, the indexes are 0 through n - 1, where n is the number of blocks per set. #include #define SetCount (4) #define WordCount (4) #define Empty (-1) #define True (1) #define False (0) int sets[SetCount][WordCount]; int lru[SetCount]; int already_in_p(int word, int set_index) { int w = 0; while (w < WordCount) if (word == sets[set_index][w]) return True; else w++; return False; } int main() { int reference_string[ ] = { 1, 4, 8, 5, 20, 17, 19, 56, 9, 11, 4, 43, 5, 6, 9, 17, 9, 56, 9, Empty }; char* set_names[ ] = {"S0", "S1", "S2", "S3"}; int i, j; /* Initialize the cache to empty. */ for (i = 0; i < SetCount; i++) for (j = 0; j < WordCount; j++) sets[i][j] = -1; /* Initialize the LRU pointers to -1 */ for (i = 0; i < SetCount; i++) lru[i] = -1; /* Iterate through the reference string. */ i = 0; while (reference_string[i] != Empty) { int next_word = reference_string[i]; int set_index = next_word % SetCount; if (already_in_p(next_word, set_index)) printf("Hit for word %i in Set %i\n", next_word, set_index); else { printf("Miss for word %i in Set %i\n", next_word, set_index); lru[set_index] = (lru[set_index] + 1) % SetCount; sets[set_index][lru[set_index]] = next_word; printf("Word %i inserted into Set %i at position %i\n", next_word, set_index, lru[set_index]); } i++; } return 0; } /* output: Miss for word 1 in Set 1 Word 1 inserted into Set 1 at position 0 Miss for word 4 in Set 0 Word 4 inserted into Set 0 at position 0 Miss for word 8 in Set 0 Word 8 inserted into Set 0 at position 1 Miss for word 5 in Set 1 Word 5 inserted into Set 1 at position 1 Miss for word 20 in Set 0 Word 20 inserted into Set 0 at position 2 Miss for word 17 in Set 1 Word 17 inserted into Set 1 at position 2 Miss for word 19 in Set 3 Word 19 inserted into Set 3 at position 0 Miss for word 56 in Set 0 Word 56 inserted into Set 0 at position 3 Miss for word 9 in Set 1 Word 9 inserted into Set 1 at position 3 Miss for word 11 in Set 3 Word 11 inserted into Set 3 at position 1 Hit for word 4 in Set 0 Miss for word 43 in Set 3 Word 43 inserted into Set 3 at position 2 Hit for word 5 in Set 1 Miss for word 6 in Set 2 Word 6 inserted into Set 2 at position 0 Hit for word 9 in Set 1 Hit for word 17 in Set 1 Hit for word 9 in Set 1 Hit for word 56 in Set 0 Hit for word 9 in Set 1 */ 3. Use the same reference string as in question (1). Assume that cache C is 2-way set associative with 8 sets. Each block holds 1 word. Replacement is LRU. A memory reference's cache set is the memory address modulus 8 (for example, the word at address 8 goes into set 0; the word at address 11 goes into set 3; etc.) Do the same as in (1). **** Solution: Time ticks are shown in columns for ease of tracing. Accesses are marked with M or H for "miss" and "hit," respectively (e.g., 1M means 1 was accessed but was not already in the cache but was put there as a result of the access.) 1, 4, 8, 5, 20, 17, 19, 56, 9, 11, 4, 43, 5, 6, 9, 17, 9, 56, 9 t0 t1 t2 t3 t4 t5 t6 t7 t8 t9 ta tb tc td te tf tg th ti S0 L1: 8M L2: 56M 56H S1 L1: 1M 9M 9H 9H 9H L2: 17M 17H S2 L1: L2: S3 L1: 19M 43M L2: 11M S4 L1: 4M 4H L2: 20M S5 L1: 5M 5H L2: S6 L1: 6M L2: S7 L1: L2: 4. Computer system S uses 32-bit virtual addresses as cache addresses. The cache address has three fields, left to right: tag bits set identifier word offset So how many bits are used in each field given 1,024 sets each with 8 lines. Each line contains 32 8-bit words. (A "line" is the same as a "block.") **** Solution: The "log" used here is to base 2. log(1024) == 10: so 10 bits for the set index log(32) == 5: so 5 bits for the word offset 32 - (10 + 5) == 17 bits for the tag The layout is: 17 bits 10 bits 5 bits +-----------------+----------+-----+ | tag | set index| ind | +-----------------+----------+-----+ 5. A cache's total bit size is partitioned into "data bits" (that is, data or instructions) and "overhead bits" (bits for directory tags, the valid bit, the LRU bits, if used, and so on). For now, the only "overhead bits" of interest are the directory or tag bits. Consider two set-associative caches with the same capacity: C1: 1024 sets, 8 blocks per set, 32 words per block, 8 bits per word C2: 2048 sets, 4 blocks per set, 32 words per block, 8 bits per word Contrast the difference between "data bit" capacity and "overhead bit" size for the two caches. Assume that the word offset is log(M) low-order (rightmost) bits, where M is the number of words per block; and that set address is the middle log(N) low-order bits, where N is the number of sets. The remaining (leftmost) bits are directory tags. Each set uses 32-bit addresses. **** Solution: "log" again means "log to base 2." cache capacity in bits = sets * lines per set * words per line * bits per word directory bits = sets * lines per set * tag bits per line C1: cache capacity in bits == 1024 * 8 * 32 * 8 == 2,097,152 So capacity in bits is about 2M. set index == log(1024) == 10 bits word offset == log(32) == 5 bits tag == 32 - (10 + 5) == 17 bits There are 1,024 sets, each with 8 lines or blocks, each of which has a 17-bit tag; so the total for the tags (the "directory") is 1024 * 8 * 17 == 139,264 or roughly 140K bits. So the ratio is 139264 / 2097152 or roughly 6.6%. C2: cache capcicity in bits == 2048 * 4 * 32 * 8 = 2,097,152 set index == log(2048) == 11 bits word offset == log(32) == 5 bits tag == 32 - (11 + 5) == 16 bits There are 2,048 sets, each with 4 lines, each of which has a 16-bit tag; so the total for the directory is 2048 * 4 * 16 == 131,072 or roughly 130K bits. So the ratio is 131072 / 2097152 or roughly 6.25%. Circuitry considerations aside, C1 has only a slighly larger directory but has twice as many lines into which a block of memory bytes can go. In effect, the cost of the extra directory bits is insignificant given the potential reducation in conflict misses (that is, misses that occur when multiple memory addresses map to the same cache set). Of course, C2 has twice as many sets --- and defenders of C2's design would argue that the doubling of sets likewise reduces the potential for conflict misses. Conclusion: it's a hard call. The trend, though, is to increase both the number of sets and lines per set. 6. Consider the 32-bit virtual address 11110000 11110000 11110000 11110000 for an L2 cache with 4 blocks per set, 2048 sets, and 128 words per block. The address's low-order bits are on the right. Assume standard formatting for a cache address. (I've broken the address into four chunks of 8 bits apiece for readability only.) 1. Mark the bits that give the word offset in the line. 2. Mark the bits that specify the set. 3. Mark the bits that constitute the tag or key. 4. How many tags are in the directory as a whole? 5. How many directory tags must be compared against this address’s tag bits when a cache lookup occurs? **** Solution: Bits that mark the word offset have an 'o' above them. Bits that mark the set index have an 's' above them. Bits that mark the tag have a 't' above them. log(1024) == 10 log(128) == 7 Tag bits per line == 32 - (10 + 7) == 15 tttttttt ttttttts ssssssss sooooooo 11110000 11110000 11110000 11110000 7. Some cache designers are abandoning LRU as replacement policy and going instead to random replacement. To see the problem, consider a 6-way set associative cache with each block holding 1 word. Now consider this reference string for a code segment in a loop: 1, 4, 7, 8, 9, 11, 13, 1, 4, 7, 8, 9, 11, 13, 1, 4, 7, 8, 9, 11, 13,... What problems does LRU cause given this reference string and our cache capacity? (Assume that words enter the cache in the order in which they arrive, for example, the word at address 1 goes into the first block in a given set, the word at address 4 goes into the second slot, etc.) Make the case that random replacement would yield better performance than does LRU for this reference string. **** From left to right I show history in a particular set line. I'm assuming all words map to the same set but failed to say so in the question; hence, you'll be graded on whatever assumption you made. By assumption, all addresses map to the same set. After the word is either M for miss or H for hit (e.g., 1M means that the word at address 1 went in but was not already in). 1, 4, 7, 8, 9, 11, 13, 1, 4, 7, 8, 9, 11, 13, 1, 4, 7, 8, 9, 11, 13,... Clock ticks: t1 t2 t3 t4 t5 t6 t7 t8 t9 ta tb tc td te tf tg th ti tj tk tl Set L1: 1M 13M 11M 9M L2: 4M 1M 13M 11M L3: 7M 4M 1M 13M L4: 8M 7M 4M L5: 9M 8M 7M L6: 11M 9M 8M Conclusion: The "working set" of addresses is 1, 4, 7, 8, 9, 11, 13 and this repeats indefinitely. Under LRU, the word that gets kicked out is the next word to be accessed. For example, the last word in the working set is 13 and the first is 1; so, by definition, 1 is the least recently used as it occurs only once before 13. So 13 kicks out 1 in the cache. Immediately after 13 comes 1 again, which kicks out 4, which comes next; and so on. Think of the working set as the body of a loop that iterates thousands or even millions of times. What a disaster! If we just randomly selected one of the word to kick out, we could not do any worse and presumably, given the laws of probability, would do better.