- Caches - Review
- Operating System Memory Management without Virtual Memory
- Virtual Memory Overview
- Virtual Memory Hardware
-
Operating System Memory Management with Virtual Memory
-
Virtual Memory Calculations
Level 1 and level 2 caches (L1 and L2) can provide a very
fast cpu with a copy of a portion of main memory
currently being accessed.
-
L1 cache is more expensive/byte that main memory and takes
up valuable space on the cpu chip.
Accessing the L1 cache is typically almost as fast as
accessing registers.
-
The L2 cache is off the cpu chip. It may take a few cpu
clock cycles to access, but may be connected to
it by a special bus and accessing it is on the order of
100 times faster than accesssing main memory.
-
Cache memories are orgainized into a fixed number of
sets.
-
Each set has a fixed number of blocks of size B bytes.
Each block can hold a copy of B bytes from main
memory.
Each block also has an storage for an associated
tag and a valid bit.
- A cache line consists of the valid bit, the tag and
the data block.
-
The number of cache lines per cache set determines the
associativity of the cache.
A direct mapped cache has 1 line per
set.
A 2-way associative cache has 2 lines per
set.
A fully associative cache has only 1 set and as
many lines as the total cache memory can hold.
Lookup in a cache for contents of a given address depends
on:
Notation |
Item |
S |
Number of sets |
B |
Cache line block size |
E |
Number of lines per cache set |
The number of lines per set determines (and is determined by)
the total data capacity of the cache, C:
C = S * E * B (Sets * Lines/Set * Block Data size of each Line )
To check if the cache holds the contents of an address, the
address is partitioned into three parts.
For example, for a direct mapped cache with 512 sets and a
block size of 32 bytes and 32 bit addresses, the cache data capacity
is 512 * 1 * 32 = 4096 = 4K bytes and
Parameter |
Size |
Number of address bits |
S |
512 |
9 |
B |
32 |
5 |
Tag |
- |
18 |
The address is partioned into
Tag |
Set |
Block Offset |
bits 31 - 14 |
bits 13 - 5 |
bits 4 - 0 |
Lookup a 4 byte integer at address 0x08048088. Assume the
machine is little endian.
hex > 0x08048088
binary > 0000 1000 0000 0100 1000 0000 1000 1000
T:S:B > 0000 1000 0000 0100 10:00 0000 100:0 1000
T > 00 0010 0000 0001 0010 = 0x02012
S > 0 0000 0100 = 0x004 = 4
B > 0 1000 = 0x08
So look in set S = 4 and compare the tag in the (only) line
of set 4 with 0x02012.
Suppose the first 16 bytes of the data block of the line
in set 4 of the cache is
Tag: 0x2012 |
Block data: 01 02 00 00 FF FF FF FF EF BE AD DE 01 02 03 04 |
So address 0x08048090 is a cache hit
Since B = 8, the integer data starts at byte offset 8 in the
block data.
Integers are 4 bytes and the bytes are stored
in reverse order on little endian machines.
So the integer value is 0xDEADBEEF
-
If the cache line had been:
Tag: 0x2080 |
Block data: 01 02 00 00 FF FF FF FF EF BE AD DE 01 02 03 04 |
The tag would not match and address 0x08048090 would
be a cache miss.
-
Program segments (e.g., the code segment) must be loaded into
contiguous memory.
-
Consecutive memory locations are required because the compiler
generates code for "sequential statements":
- sequential statements
- "if statments"
- "loops" that generate
jump or branch instructions to addresses
all expect instructions to be stored in consecutive
memory locations.
-
In particular, when an instruction is fetched from the memory address
in the PC register (%eip on Intel IA32), the PC is then incremented. This assumes
that the next instruction is in the next memory address.
Programs must therefore be loaded into contiguous memory blocks large enough to
hold the entire program.
For an operating system embedded in a single purpose device with no external storage, this may be ok.
In a multiprogramming operating system, many processes will be loaded
into memory.
Memory allocation occurs when a process is created and
deallocation occurs when the process terminates.
However, the order of
deallocation is not predictable from the order of allocation!
So in general, there must be a free list of the memory blocks
not in use and available for allocation.
This list needs to keep
track of the beginning address of each free memory block and the size
of the block.
Memory allocation will select a free memory block (or part of one)
which is big enough for the memory request and remove
the block (or part of it) from the free list.
Process termination requires returning the memory to the free list.
Internal fragmentation is memory that is allocated to a process
but not used (for data or code).
This may occur if memory is allocated in minimum units such as
multiples of 4K bytes. A program that requries 150K + 1 bytes would be
allocated 154K bytes. Of this 4095 bytes (4k - 1 bytes) would be
internal fragmentation.
Internal fragmentation represents some inefficiency in the use of memory.
External fragmentation is free memory that is broken into
contiguous memory blocks so small that they cannot contain any program
and so cannot be allocated.
Three typical memory allocation algorithms are:
First Fit Memory blocks are ordered.
Choose the first block large enough.
Best Fit Choose the smallest free block which is large
enough.
Worst Fit Choose the largest block.
In all cases, if the chosen free block is not the exact size, split
off the amount needed, leaving a smaller free block of the extra
amount. This eliminates all internal fragmentation.
Unfortunately, these allocation strategies cannot avoid external fragmentation.
A memory management system is working with a total memory size of
700K and uses contiguous allocation and variable sized partitions so
that no internal fragmentation occurs. It currently has three free
blocks available. Their sizes are 50K, 155K, and 100K, respectively,
and their location in the memory is as below. The order of the free
list is just the increasing order of the beginning addresses of the
free blocks.
0 --- increasing addresses --> 700K
+-------+-----+------+------+------+------+------+
| used | 50K | used | 155K | used | 100K | used |
+-------+-----+------+------+------+------+------+
- Show how First-Fit would allocate memory to 3 new processes of
size 90K, 100K, and 60K (in that order).
- Show how Best-Fit would allocate memory for the same three
processes.
Request | First-Fit | Best-Fit |
| 50,155,100 | 50,155,100 |
90 | 50,65,100 | 50,155,10 |
100 | 50,65 | 50,55,10 |
60 | 50, 5 | wait |
At the third request (for 60K), both First Fit and Best Fit has a
free list with a total size of 115K, more that enough. However, for
Best Fit this 115K is broken (fragmented) into three pieces of size
50K, 55K, and 10K. None of these 3 blocks is big enough for the
request. So the request must wait until some process terminates and
releases more memory.
Memory must be allocated in contiguous memory blocks large enough to
hold the entire program (or at least each entire segment - code, data, etc.)
Internal and external memory fragmentation can result. Each of these
represents unused memory which limits the number and/or size of
simultaneous processes.
Memory allocation algorithms that split off the exact memory request
size from free blocks can eliminate internal fragmentation.
However, these algorithms (first fit, best fit, worst fit) all
suffer from external fragmentation.
Virtual memory removes the requirement that segments must be in
contiguous memory blocks.
A consequence is that external fragmentation is eliminated.
Main idea: Physical memory is though of as being split into
fixed sized pages. All pages have the same size. Typically this
is on the order of 1K bytes.
If a request for 60K is made, then this would require 60 pages (of
size 1K each). Any 60 free pages can be allocated. The 60 free
pages do not have to form one contiguous block of 60K bytes. They can
be scattered anywhere in physical memory.
The key to making this scheme possible is:
1. Compilers still generate code as if the program (segments) are to
be loaded into contiguous blocks of storage.
2. This means the PC will still work as before - after fetching an
instruction, the PC is incremented.
3. The executable program is also thought of being split into pages
of the same size as used for physical memory (e.g., 1K byte
pages). These are called virtual pages as opposed to the
physical pages of memory.
4. A virtual page is loaded into any free physical page. (A
data structure, the page table must record for each virtual
page number, the physical page number where it was stored.)
5. The virtual address in the PC is not simply copied into the MAR,
however. It is translated by the hardware memory management unit
(MMU) in the cpu to the corresponding physical address where the
code instructin is actually located. The MMU must have the page
table information in order to do this.
Assume the page size is 100 bytes for this example. We divide the
program into pages and memory is already divided into page frames.
Suppose that free list of frames (i.e., physical pages) is
(0,1,2,4,6,7,8). We can use the first five to hold our program:
Virtual Program Physical Memory
page page frame contents
+--------+ +--------+
0 | page 0 | 0 | page 1 |
+--------+ +--------+
1 | page 1 | 1 | page 0 |
+--------+ +--------+
2 | page 2 | 2 | page 2 |
+--------+ +--------+
3 | page 3 | 3 | |
+--------+ +--------+
4 | page 4 | 4 | page 3 |
+--------+ +--------+
5 | |
+--------+
6 | page 4 |
+--------+
7 | |
+--------+
8 | |
+--------+
Virtual page 0 is in physical page (or page frame) 1
Virtual page 1 is in physical page (or page frame) 0
Virtual page 2 is in physical page (or page frame) 2
Virtual page 3 is in physical page (or page frame) 4
Virtual page 4 is in physical page (or page frame) 6
Page Table
Page Frame No. Protection*
+---------+---------+
0 | 1 | N |
+---------+---------+
1 | 0 | R |
+---------+---------+
2 | 2 | R |
+---------+---------+
3 | 4 | W |
+---------+---------+
4 | 6 | W |
+---------+---------+
Protection: R = Read only (e.g. code)
W = Read or Write allowed for this page (e.g. data)
N = Neither Read or Write access (to catch address errors)
The following notation is used in connection with translation
from virtual to physical addresses:
Address |
Cache |
virtual |
TLB |
VPN |
Virtual page number |
VPO |
Virtual page offset (in bytes) |
TLBI |
TLBI index |
TLBT |
TLB tag |
After translating the virtual address to a physical address,
another cache is checked to see if it contains the memory contents of the
physical address. This notation is used:
Address |
Cache |
physical |
L1 |
(physical adddresses) |
PPN |
physical page number |
PPO |
physical page offset (PPO = VPO) |
CI |
Cache index |
CO |
Byte offset in cache block |
CT |
Cache tag |
This example uses the following assumptions (See practice
problem 10.4):
- Virtual address size: 14 bits
- Physical address size: 12 bits
- Page size: 64 bytes
-
4-way-associative TLB with 4 sets (each line contains one
page table entry - PTE)
Problem: Translate virtual address 0x03d7 to a physical address
Virtual Address Format
13 |
12 |
11 |
10 |
9 |
8 |
7 |
6 |
5 |
4 |
3 |
2 |
1 |
0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
VPN |
|
VPO |
Physical Address Format
11 |
10 |
9 |
8 |
7 |
6 |
5 |
4 |
3 |
2 |
1 |
0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
PPN |
|
PPO |
First write 0x03d7 in binary:
0000 0011 1101 0111 (but this is 16 bits, so discard left 2 bits)
and partition the bits into the VPN and VPO parts:
VPN VPO
00001111 010111
Now convert VPN and VPO back to hex
VPN = 0000 1111 = 0x0F
VPO = 01 0111 = 0x17
The VPN is the index into the page table for the current
process.
However, the page table is in memory.
We do not want to have to access memory just to translate the
virtual address!
So first see if the page table entry we need is in the TLB
cache in the CPU.
VPN = 00001111 = TLBT : TLBI = 000011 : 11
There are 4 sets: 0,1,2,3. The right 2 bits of the VPN form the
TLB index, which is the same as the set number.
So the TLBI = 11 (binary) = 3 (in decimal).
The TLB tag must be compared with the 4 entries in set 3.
(Remember that the TLB is a 4-way-associative cache of page
table entries.)
TLBT = 000011 = 00 0011 = 0x03
The four tags for set 3 are
Tag PPN Valid
07 - 0
03 0D 1
0A 34 1
02 - 0
So the physical page number, PPN= 0x0D.
This information is also in the page table in memory, but if we
have a TLB hit, we avoid having to access memory for the PTE.
VPN = 0x0F, VPO =
0x17
Entry at index 0x0F is valid, so PPN = 0x0D
PPO always = VPO, so the physical address is PPN:PPO.
VPN |
PPN |
Valid |
00 |
28 |
1 |
01 |
- |
0 |
02 | 33 | 1 |
03 | 02 | 1 |
04 | - | 0 |
05 | 16 | 1 |
06 | - | 0 |
07 | - | 0 |
08 | 13 | 1 |
09 | 17 | 1 |
0A | 09 | 1 |
0B | - | 0 |
0C | - | 0 |
0D | 2D | 1 |
0E | 11 | 1 |
0F | 0D | 1 |
PPN = 0x0D, PPO = VPO = 0x17.
But we have to concatenate these bits to get the physical
address = PPN:PPO and remember that PPN is 6 bits and PPO is 6 bits
PPN = 0x0D = 0000 1101 (but discard left 2 bits) = 00 1101
PPO = 0x17 = 0001 0111 (but discard left 2 bits) = 01 0111
Physical address = PPN:PPO = 00 1101 01 0111 = 0011 0101 0111 = 0x357
0000 11
The physical address 0x357 was determined by the MMU hardware
in the cpu from the virtual address since there was a hit
in the TLB for the page table entry.
Now a lookup in the L1 cache would check to see if the contents
of the physical address are available (a hit in the L1 cache).
Summary:
-
The TLB cache in the CPU is used to try to translate the
virtual address without having to access the page table
in memory.
-
After the virtual address has been translated, the L1 cache
is used to try to access the data at the physical
address without having to access the actual physical
location in main memory.
If a cache miss occurs in either case, memory must be
accessed. (In this case the corresponding cache is updated.)
Problem: Look in the L1 cache for the byte contents of the
physical address just found: 0x357
Physical address: 0x357 = 0011 0101 0111
CT = 0011 01 = 0x0D
CI = 01 01 = 0x5
CO = 11 = 0x3
L1 cache:
Idx |
Tag |
Valid |
Blk 0 |
Blk 1 |
Blk 2 |
Blk 3 |
0 |
19 |
1 |
99 |
11 |
12 |
11 |
1 |
15 |
0 |
- |
- |
- |
- |
2 |
1B |
1 |
00 |
02 |
04 |
08 |
3 |
36 |
0 |
- |
- |
- |
- |
4 |
32 |
1 |
43 |
6D |
8F |
09 |
5 |
0D |
1 |
36 |
72 |
F0 |
1D |
6 |
31 |
0 |
- |
- |
- |
- |
7 |
16 |
1 |
11 |
C2 |
DF |
03 |
8 |
24 |
1 |
3A |
00 |
51 |
89 |
9 |
2D |
0 |
- |
- |
- |
- |
A |
2D |
1 |
93 |
15 |
DA |
3B |
B |
0B |
0 |
- |
- |
- |
- |
C |
12 |
0 |
- |
- |
- |
- |
D |
16 |
1 |
04 |
96 |
34 |
15 |
E |
13 |
1 |
83 |
77 |
1B |
D3 |
F |
14 |
0 |
- |
- |
- |
- |
A page fault occurs (1) when a page is referenced which is part
of the process's code or data regions, but is not
currently loaded in memory or (2) when a page is referenced
which is not in the process's code or data regions.
If VPN had been 07 and set 3, we would have had a TLB
miss.
The page table entry for VPN 07 would have to be fetched from
memory.
But that page table entry is
VPN PPN Valid
07 - 0
The page is NOT in memory. So the MMU hardware generates a page
fault.
The operating system page fault handler then must determine if
the page belongs to any of the process's code or data regions. For
now assume it does.
The page fault handler then must fetch the
page from disk! (~100,000 times slower)
Then the Page Table must be updated and the entry changed to
valid.
The TLB cache is also updated.
Then what?
A page fault happens in the fetch step of the cpu cycle when the
next instruction is to be loaded from a memory address and the page
containing that part of the code is not in memory.
It can also occur when an instruction's operands are being
fetched just prior to execution of the instruction.
Or it can happen when an instruction is being excuted in the
cpu cycle - for example, an store instruction that moves the value in
a cpu register to a (virtual) memory location, but that virtual memory
location is not currently in memory.
What happens next?
The sequence of actions associated with a page fault are:
-
The page fault is detected by the MMU hardware.
-
The MMU hardware loads page fault handler PC/PSW from the interrupt
vector (similar to interrupts)
-
After the page fault handler has fetched the page
(possibly replacing some page) and recorded the changes in the
page table, it must adjust the interrupted user's PC. It must be
set back to the beginning of the instruction since the
instruction did not execute.
-
Finally, the page fault handler can return control to the user,
program, which will again attempt to execute the instruction that
caused the page fault.
We would clearly like to minimize the number of page faults and avoid
the extra time overhead necessary to handle page faults.
Before virtual memory, physical memory had to be contiguous and
memory management needed to minimize external
fragmentation.
With virtual memory, physical memory for a segment no longer
has to be contiguous.
So the problem of fragmentation of physical memory goes
away!
New Problem: Each segment of virtual memory must be
contiguous in the virtual address space of a process.
This is not a problem for most segments, but it is a problem
for the heap.
The kernel keeps track of a break address that is the
end of the heap segment for each process. (in a variable named
brk).
There are system calls to increase heap segment, which increase
the break address.
#include <unistd.h>
int brk(void *end_data_segment); // sets the "brk" value
void *sbrk(intptr_t increment); // adds increment to "brk" value
malloc and free[33] [top]
The memory map function is one way to dynamically create a
new chunk of memory (as a program is running) as a new virtual
memory segment.
More commonly applications have (and continue to) use
functions malloc and free (or new and delete) to
dynamically allocate memory from the heap segment.
#include <stdlib.h>
void *calloc(size_t nmemb, size_t sz);
void *malloc(size_t sz);
void free(void *ptr);
void *realloc(void *ptr, size_t sz);
The memory allocation functions (calloc, malloc, and realloc)
all simply specify a size (in bytes) or in the case of calloc,
an array of nmemb elements, each of size sz.
The implementation of calloc is not significantly different than malloc.
The realloc function does require a few additional details
beyond malloc.
The main issues in heap management arise with the two
functions:
- malloc
- free
-
Both malloc and free should be fast.
-
Together, malloc and free should make efficient use of the
heap segment.
If the heap segment becomes too fragmented, malloc can
increase its size (recall the sbrk function).
However, the external fragmentation in the heap segment
represents an inefficient use of the virtual memory (and
backing physical) resources.
These two requirements are in conflict!