Oct04

Caches - Review
Operating System Memory Management without Virtual Memory
Virtual Memory Overview
Virtual Memory Hardware
Operating System Memory Management with Virtual Memory
Virtual Memory Calculations

Level 1 and level 2 caches (L1 and L2) can provide a very fast cpu with a copy of a portion of main memory currently being accessed.

L1 cache is more expensive/byte that main memory and takes up valuable space on the cpu chip.

Accessing the L1 cache is typically almost as fast as accessing registers.
The L2 cache is off the cpu chip. It may take a few cpu clock cycles to access, but may be connected to it by a special bus and accessing it is on the order of 100 times faster than accesssing main memory.

Cache memories are orgainized into a fixed number of sets.
Each set has a fixed number of blocks of size B bytes.
Each block can hold a copy of B bytes from main memory.

Each block also has an storage for an associated tag and a valid bit.
A cache line consists of the valid bit, the tag and the data block.
The number of cache lines per cache set determines the associativity of the cache.
A direct mapped cache has 1 line per set.

A 2-way associative cache has 2 lines per set.

A fully associative cache has only 1 set and as many lines as the total cache memory can hold.

Lookup in a cache for contents of a given address depends on:

Notation	Item
S	Number of sets
B	Cache line block size
E	Number of lines per cache set

The number of lines per set determines (and is determined by) the total data capacity of the cache, C:

        C = S * E * B  (Sets * Lines/Set * Block Data size of each Line )

To check if the cache holds the contents of an address, the address is partitioned into three parts.

For example, for a direct mapped cache with 512 sets and a block size of 32 bytes and 32 bit addresses, the cache data capacity is 512 * 1 * 32 = 4096 = 4K bytes and

Parameter	Size	Number of address bits
S	512	9
B	32	5
Tag	-	18

The address is partioned into

Tag	Set	Block Offset
bits 31 - 14	bits 13 - 5	bits 4 - 0

Lookup a 4 byte integer at address 0x08048088. Assume the machine is little endian.

	 hex > 0x08048088
      binary > 0000 1000 0000 0100 1000 0000 1000 1000
      T:S:B  > 0000 1000 0000 0100 10:00 0000 100:0 1000
	  T  > 00 0010 0000 0001 0010 = 0x02012
	  S  > 0 0000 0100 = 0x004 = 4
	  B  > 0 1000 = 0x08

So look in set S = 4 and compare the tag in the (only) line of set 4 with 0x02012.

Suppose the first 16 bytes of the data block of the line in set 4 of the cache is

Tag: 0x2012 Block data: 01 02 00 00 FF FF FF FF EF BE AD DE 01 02 03 04

So address 0x08048090 is a cache hit

Since B = 8, the integer data starts at byte offset 8 in the block data.

Integers are 4 bytes and the bytes are stored in reverse order on little endian machines.

So the integer value is 0xDEADBEEF
If the cache line had been:

Tag: 0x2080 Block data: 01 02 00 00 FF FF FF FF EF BE AD DE 01 02 03 04

The tag would not match and address 0x08048090 would be a cache miss.

Program segments (e.g., the code segment) must be loaded into contiguous memory.
Consecutive memory locations are required because the compiler generates code for "sequential statements":
- sequential statements
- "if statments"
- "loops" that generate jump or branch instructions to addresses
all expect instructions to be stored in consecutive memory locations.
In particular, when an instruction is fetched from the memory address in the PC register (%eip on Intel IA32), the PC is then incremented. This assumes that the next instruction is in the next memory address.

Programs must therefore be loaded into contiguous memory blocks large enough to hold the entire program.

For an operating system embedded in a single purpose device with no external storage, this may be ok.

In a multiprogramming operating system, many processes will be loaded into memory.

Memory allocation occurs when a process is created and deallocation occurs when the process terminates.

However, the order of deallocation is not predictable from the order of allocation!

So in general, there must be a free list of the memory blocks not in use and available for allocation.

This list needs to keep track of the beginning address of each free memory block and the size of the block.

Memory allocation will select a free memory block (or part of one) which is big enough for the memory request and remove the block (or part of it) from the free list.

Process termination requires returning the memory to the free list.

Internal fragmentation is memory that is allocated to a process but not used (for data or code).

This may occur if memory is allocated in minimum units such as multiples of 4K bytes. A program that requries 150K + 1 bytes would be allocated 154K bytes. Of this 4095 bytes (4k - 1 bytes) would be internal fragmentation.

Internal fragmentation represents some inefficiency in the use of memory.

External fragmentation is free memory that is broken into contiguous memory blocks so small that they cannot contain any program and so cannot be allocated.

Three typical memory allocation algorithms are:

        First Fit  Memory blocks are ordered. 
                   Choose the first block large enough.

        Best Fit   Choose the smallest free block which is large
                   enough.

        Worst Fit  Choose the largest block.

In all cases, if the chosen free block is not the exact size, split off the amount needed, leaving a smaller free block of the extra amount. This eliminates all internal fragmentation.

Unfortunately, these allocation strategies cannot avoid external fragmentation.

A memory management system is working with a total memory size of 700K and uses contiguous allocation and variable sized partitions so that no internal fragmentation occurs. It currently has three free blocks available. Their sizes are 50K, 155K, and 100K, respectively, and their location in the memory is as below. The order of the free list is just the increasing order of the beginning addresses of the free blocks.

0      --- increasing addresses -->          700K
+-------+-----+------+------+------+------+------+
| used  | 50K | used | 155K | used | 100K | used |
+-------+-----+------+------+------+------+------+

Show how First-Fit would allocate memory to 3 new processes of size 90K, 100K, and 60K (in that order).
Show how Best-Fit would allocate memory for the same three processes.

Request	First-Fit	Best-Fit
	50,155,100	50,155,100
90	50,65,100	50,155,10
100	50,65	50,55,10
60	50, 5	wait

At the third request (for 60K), both First Fit and Best Fit has a free list with a total size of 115K, more that enough. However, for Best Fit this 115K is broken (fragmented) into three pieces of size 50K, 55K, and 10K. None of these 3 blocks is big enough for the request. So the request must wait until some process terminates and releases more memory.

Memory must be allocated in contiguous memory blocks large enough to hold the entire program (or at least each entire segment - code, data, etc.)

Internal and external memory fragmentation can result. Each of these represents unused memory which limits the number and/or size of simultaneous processes.

Memory allocation algorithms that split off the exact memory request size from free blocks can eliminate internal fragmentation.

However, these algorithms (first fit, best fit, worst fit) all suffer from external fragmentation.

Virtual memory removes the requirement that segments must be in contiguous memory blocks.

A consequence is that external fragmentation is eliminated.

Main idea: Physical memory is though of as being split into fixed sized pages. All pages have the same size. Typically this is on the order of 1K bytes.

If a request for 60K is made, then this would require 60 pages (of size 1K each). Any 60 free pages can be allocated. The 60 free pages do not have to form one contiguous block of 60K bytes. They can be scattered anywhere in physical memory.

The key to making this scheme possible is:

1. Compilers still generate code as if the program (segments) are to
   be loaded into contiguous blocks of storage.
2. This means the PC will still work as before - after fetching an
   instruction, the PC is incremented.
3. The executable program is also thought of being split into pages
   of the same size as used for physical memory (e.g., 1K byte
   pages). These are called virtual pages as opposed to the
   physical pages of memory.
4. A virtual page is loaded into any free physical page. (A
   data structure, the page table must record for each virtual
   page number, the physical page number where it was stored.)
5. The virtual address in the PC is not simply copied into the MAR,
   however. It is translated by the hardware memory management unit
   (MMU) in the cpu to the corresponding physical address where the
   code instructin is actually located. The MMU must have the page
   table information in order to do this.

Assume the page size is 100 bytes for this example. We divide the program into pages and memory is already divided into page frames. Suppose that free list of frames (i.e., physical pages) is (0,1,2,4,6,7,8). We can use the first five to hold our program:


Virtual   Program                     Physical Memory
page                         page frame  contents
        +--------+                      +--------+
 0      | page 0 |              0       | page 1 |
        +--------+                      +--------+
 1      | page 1 |              1       | page 0 |
        +--------+                      +--------+
 2      | page 2 |              2       | page 2 |
        +--------+                      +--------+
 3      | page 3 |              3       |        |
        +--------+                      +--------+
 4      | page 4 |              4       | page 3 |
        +--------+                      +--------+
                                5       |        |              
                                        +--------+
                                6       | page 4 |
                                        +--------+
                                7       |        |
                                        +--------+
                                8       |        |
                                        +--------+

Virtual page 0 is in physical page (or page frame) 1
Virtual page 1 is in physical page (or page frame) 0
Virtual page 2 is in physical page (or page frame) 2
Virtual page 3 is in physical page (or page frame) 4
Virtual page 4 is in physical page (or page frame) 6

               Page Table
  Page    Frame No. Protection*
         +---------+---------+
    0    |    1    |    N    |
         +---------+---------+  
    1    |    0    |    R    |
         +---------+---------+
    2    |    2    |    R    |
         +---------+---------+
    3    |    4    |    W    |
         +---------+---------+
    4    |    6    |    W    |
         +---------+---------+

Protection: R = Read only  (e.g. code)
            W = Read or Write allowed for this page (e.g. data)
            N = Neither Read or Write access (to catch address errors)

The following notation is used in connection with translation from virtual to physical addresses:

Address	Cache
virtual	TLB

VPN	Virtual page number
VPO	Virtual page offset (in bytes)
TLBI	TLBI index
TLBT	TLB tag

After translating the virtual address to a physical address, another cache is checked to see if it contains the memory contents of the physical address. This notation is used:

Address	Cache
physical	L1

(physical adddresses)
PPN	physical page number
PPO	physical page offset (PPO = VPO)
CI	Cache index
CO	Byte offset in cache block
CT	Cache tag

This example uses the following assumptions (See practice problem 10.4):

Virtual address size: 14 bits
Physical address size: 12 bits
Page size: 64 bytes
4-way-associative TLB with 4 sets (each line contains one page table entry - PTE)

Problem: Translate virtual address 0x03d7 to a physical address

Virtual Address Format

13	12	11	10	9	8	7	6	5	4	3	2	1	0

			VPN								VPO

Physical Address Format

11	10	9	8	7	6	5	4	3	2	1	0

			PPN						PPO

First write 0x03d7 in binary:

      0000 0011 1101 0111 (but this is 16 bits, so discard left 2 bits)

and partition the bits into the VPN and VPO parts:

         VPN      VPO
       00001111 010111

Now convert VPN and VPO back to hex

      VPN = 0000 1111 = 0x0F
      VPO =   01 0111 = 0x17

The VPN is the index into the page table for the current process.

However, the page table is in memory.

We do not want to have to access memory just to translate the virtual address!

So first see if the page table entry we need is in the TLB cache in the CPU.

      VPN = 00001111 = TLBT : TLBI = 000011 : 11

There are 4 sets: 0,1,2,3. The right 2 bits of the VPN form the TLB index, which is the same as the set number.

So the TLBI = 11 (binary) = 3 (in decimal).

The TLB tag must be compared with the 4 entries in set 3.

(Remember that the TLB is a 4-way-associative cache of page table entries.)

      TLBT = 000011 = 00 0011 = 0x03

The four tags for set 3 are

      Tag PPN Valid
      07   -    0
      03   0D   1
      0A   34   1
      02   -    0

So the physical page number, PPN= 0x0D.

This information is also in the page table in memory, but if we have a TLB hit, we avoid having to access memory for the PTE.

VPN = 0x0F, VPO = 0x17

Entry at index 0x0F is valid, so PPN = 0x0D

PPO always = VPO, so the physical address is PPN:PPO.

VPN	PPN	Valid
00	28	1
01	-	0
02	33	1
03	02	1
04	-	0
05	16	1
06	-	0
07	-	0
08	13	1
09	17	1
0A	09	1
0B	-	0
0C	-	0
0D	2D	1
0E	11	1
0F	0D	1

PPN = 0x0D, PPO = VPO = 0x17.

But we have to concatenate these bits to get the physical address = PPN:PPO and remember that PPN is 6 bits and PPO is 6 bits

      PPN = 0x0D = 0000 1101 (but discard left 2 bits) = 00 1101
      PPO = 0x17 = 0001 0111 (but discard left 2 bits) = 01 0111

      Physical address = PPN:PPO = 00 1101 01 0111 =  0011 0101 0111 = 0x357
     0000 11

The physical address 0x357 was determined by the MMU hardware in the cpu from the virtual address since there was a hit in the TLB for the page table entry.

Now a lookup in the L1 cache would check to see if the contents of the physical address are available (a hit in the L1 cache).

Summary:

The TLB cache in the CPU is used to try to translate the virtual address without having to access the page table in memory.
After the virtual address has been translated, the L1 cache is used to try to access the data at the physical address without having to access the actual physical location in main memory.

If a cache miss occurs in either case, memory must be accessed. (In this case the corresponding cache is updated.)

Problem: Look in the L1 cache for the byte contents of the physical address just found: 0x357

Physical address: 0x357 = 0011 0101 0111

CT = 0011 01 = 0x0D
CI = 01 01 = 0x5
CO = 11 = 0x3

L1 cache:

Idx	Tag	Valid	Blk 0	Blk 1	Blk 2	Blk 3
0	19	1	99	11	12	11
1	15	0	-	-	-	-
2	1B	1	00	02	04	08
3	36	0	-	-	-	-
4	32	1	43	6D	8F	09
5	0D	1	36	72	F0	1D
6	31	0	-	-	-	-
7	16	1	11	C2	DF	03
8	24	1	3A	00	51	89
9	2D	0	-	-	-	-
A	2D	1	93	15	DA	3B
B	0B	0	-	-	-	-
C	12	0	-	-	-	-
D	16	1	04	96	34	15
E	13	1	83	77	1B	D3
F	14	0	-	-	-	-

A page fault occurs (1) when a page is referenced which is part of the process's code or data regions, but is not currently loaded in memory or (2) when a page is referenced which is not in the process's code or data regions.

If VPN had been 07 and set 3, we would have had a TLB miss.

The page table entry for VPN 07 would have to be fetched from memory.

But that page table entry is

      VPN  PPN  Valid
      07    -     0

The page is NOT in memory. So the MMU hardware generates a page fault.

The operating system page fault handler then must determine if the page belongs to any of the process's code or data regions. For now assume it does.

The page fault handler then must fetch the page from disk! (~100,000 times slower)

Then the Page Table must be updated and the entry changed to valid.

The TLB cache is also updated.

Then what?

A page fault happens in the fetch step of the cpu cycle when the next instruction is to be loaded from a memory address and the page containing that part of the code is not in memory.

It can also occur when an instruction's operands are being fetched just prior to execution of the instruction.

Or it can happen when an instruction is being excuted in the cpu cycle - for example, an store instruction that moves the value in a cpu register to a (virtual) memory location, but that virtual memory location is not currently in memory.

What happens next?

The sequence of actions associated with a page fault are:

The page fault is detected by the MMU hardware.
The MMU hardware loads page fault handler PC/PSW from the interrupt vector (similar to interrupts)
After the page fault handler has fetched the page (possibly replacing some page) and recorded the changes in the page table, it must adjust the interrupted user's PC. It must be set back to the beginning of the instruction since the instruction did not execute.
Finally, the page fault handler can return control to the user, program, which will again attempt to execute the instruction that caused the page fault.

We would clearly like to minimize the number of page faults and avoid the extra time overhead necessary to handle page faults.

Before virtual memory, physical memory had to be contiguous and memory management needed to minimize external fragmentation.

With virtual memory, physical memory for a segment no longer has to be contiguous.

So the problem of fragmentation of physical memory goes away!

New Problem: Each segment of virtual memory must be contiguous in the virtual address space of a process.

This is not a problem for most segments, but it is a problem for the heap.

The kernel keeps track of a break address that is the end of the heap segment for each process. (in a variable named brk).

There are system calls to increase heap segment, which increase the break address.


    #include <unistd.h>

    int brk(void *end_data_segment); // sets the "brk" value 

    void *sbrk(intptr_t increment);  // adds increment to "brk" value

The memory map function is one way to dynamically create a new chunk of memory (as a program is running) as a new virtual memory segment.

More commonly applications have (and continue to) use functions malloc and free (or new and delete) to dynamically allocate memory from the heap segment.

#include <stdlib.h>

void *calloc(size_t nmemb, size_t sz);
void *malloc(size_t sz);
void free(void *ptr);
void *realloc(void *ptr, size_t sz);

The memory allocation functions (calloc, malloc, and realloc) all simply specify a size (in bytes) or in the case of calloc, an array of nmemb elements, each of size sz.

The implementation of calloc is not significantly different than malloc.

The realloc function does require a few additional details beyond malloc.

The main issues in heap management arise with the two functions:

malloc
free

Both malloc and free should be fast.
Together, malloc and free should make efficient use of the heap segment.

If the heap segment becomes too fragmented, malloc can increase its size (recall the sbrk function).

However, the external fragmentation in the heap segment represents an inefficient use of the virtual memory (and backing physical) resources.

These two requirements are in conflict!

CSC374 Oct04

slide version

single file version

Contents

Today [1] [top]

L1 and L2 Caches [2] [top]

L1 and L2 Cache Memory Organization [3] [top]

Cache Memory Parameters [4] [top]

Cache Memory Lookup [5] [top]

Example [6] [top]

Memory Management Before Virtual Memory [7] [top]

Memory Allocation/Deallocation [8] [top]

Memory Management Data Structures [9] [top]

Internal Fragmentation [10] [top]

External Fragmentation [11] [top]

Memory Allocation Algorithms [12] [top]

Example [13] [top]

Summary of Memory Management (without Virtual Memory)[14] [top]

Virtual Memory (using Pages)[15] [top]

Address Translation [16] [top]

Example [17] [top]

Page Table for the Example [18] [top]

Notation [19] [top]

Example: Address Translation [20] [top]

Example: Address Formats [21] [top]

VPN and VPO [22] [top]

TLB Lookup [23] [top]

VPN is the index into Page Table [24] [top]

Physical Address [25] [top]

Data at Physical Address [26] [top]

L1 Cache Lookup for Physical Address [27] [top]

Page Faults [28] [top]

Page Faults [29] [top]

Page Fault Handling [30] [top]

Heap Management [31] [top]

Heap Size [32] [top]

malloc and free[33] [top]

Heap Management [34] [top]

Requirements [35] [top]

VPN	PPN	Valid
00	28	1
01	-	0
02	33	1
03	02	1
04	-	0
05	16	1
06	-	0
07	-	0
08	13	1
09	17	1
0A	09	1
0B	-	0
0C	-	0
0D	2D	1
0E	11	1
0F	0D	1

Idx	Tag	Valid	Blk 0	Blk 1	Blk 2	Blk 3
0	19	1	99	11	12	11
1	15	0	-	-	-	-
2	1B	1	00	02	04	08
3	36	0	-	-	-	-
4	32	1	43	6D	8F	09
5	0D	1	36	72	F0	1D
6	31	0	-	-	-	-
7	16	1	11	C2	DF	03
8	24	1	3A	00	51	89
9	2D	0	-	-	-	-
A	2D	1	93	15	DA	3B
B	0B	0	-	-	-	-
C	12	0	-	-	-	-
D	16	1	04	96	34	15
E	13	1	83	77	1B	D3
F	14	0	-	-	-	-

VPN	PPN	Valid
00	28	1
01	-	0
02	33	1
03	02	1
04	-	0
05	16	1
06	-	0
07	-	0
08	13	1
09	17	1
0A	09	1
0B	-	0
0C	-	0
0D	2D	1
0E	11	1
0F	0D	1

Idx	Tag	Valid	Blk 0	Blk 1	Blk 2	Blk 3
0	19	1	99	11	12	11
1	15	0	-	-	-	-
2	1B	1	00	02	04	08
3	36	0	-	-	-	-
4	32	1	43	6D	8F	09
5	0D	1	36	72	F0	1D
6	31	0	-	-	-	-
7	16	1	11	C2	DF	03
8	24	1	3A	00	51	89
9	2D	0	-	-	-	-
A	2D	1	93	15	DA	3B
B	0B	0	-	-	-	-
C	12	0	-	-	-	-
D	16	1	04	96	34	15
E	13	1	83	77	1B	D3
F	14	0	-	-	-	-

CSC374 Oct04

slide version

single file version

Contents

Today[1] [top]

L1 and L2 Caches[2] [top]

L1 and L2 Cache Memory Organization[3] [top]

Cache Memory Parameters[4] [top]

Cache Memory Lookup[5] [top]

Example[6] [top]

Memory Management Before Virtual Memory[7] [top]

Memory Allocation/Deallocation[8] [top]

Memory Management Data Structures[9] [top]

Internal Fragmentation[10] [top]

External Fragmentation[11] [top]

Memory Allocation Algorithms[12] [top]

Example[13] [top]

Summary of Memory Management (without Virtual Memory)[14] [top]

Virtual Memory (using Pages)[15] [top]

Address Translation[16] [top]

Example[17] [top]

Page Table for the Example[18] [top]

Notation[19] [top]

Example: Address Translation[20] [top]

Example: Address Formats[21] [top]

VPN and VPO[22] [top]

TLB Lookup[23] [top]