Memory Management
What if there's no memory abstraction?
Processes access physical memory directly, so they need to be relocated in order to not overlap in physical memory.
Can be done at program load time, but it is a bad idea:
very slow
Require extra info from program
Memory Abstraction
To allow several programs to co-exist in memory we need:
Protection
Relocation
Sharing
Logical organization
Physical organization
For that, we have a new abstraction for memory: Address Space
Address Space
Address Space: set of addresses that a process can use to address memory
Defines where sections of data and code are located in 32 or 64 address space.
Defines protection of such sections: ReadOnly, ReadWrite, Execute.
Confined "private" addressing concept: requires form of address virtualization.
Base and Limit
Map each process address space onto a different part of physical memory.
Two registers: Base and Limit.
Base: start address of a program in physical memory.
Limit: length of the program
Only OS can modify Base and Limit.
Add and Compare
For every memory access:
Base is added to the address.
Result is compared to Limit.
This can be done in hardware. So it doesn't significantly add to latency.
Swapping
Programs move in and out of memory
Holes are created
Holes can be combined → memory compaction
What if a process needs more memory?
If a hole is adjacent to the process, it is allocated to it
Process has to be moved to a bigger hole
Process suspended till enough memory is there
Managing Free Memory
Bitmap and Linked List are universal methods used in OS and applications. Other methods employ Heaps.
Bitmap is slow to find k-consecutive 0s for a new process.
Linked List method consists of allocated and free memory segments. It is more convenient to use double-linked lists.
Buddy Algorithm
Considers blocks of memory only as 2^N.
Potential for fragmentation (drawback).
If no block of a size is available, it splits higher blocks into smaller blocks.
Easy to implement and fast: O(log2(MaxBlockSz/MinBlockSz))
e.g. 4K .. 128B = 2^(12-7) = 2^(5 steps)
Examples
What are really the problems?
Memory requirement unknown for most apps.
Not enough memory: Having enough memory is not possible with current technology. How do you determine “enough”?
Exploit and enforce one condition: Processor does not execute or access anything that is not in the memory (how would that even be possible ?) Enforce transparently (user is not involved)
Memory Management Techniques
Memory management brings processes into main memory for execution by the processor
involves virtual memory
based on paging and segmentation
But we can see that...
All memory references are logical addresses in a process’s address space that are dynamically translated into physical addresses at run time
An address space may be broken up into a number of pieces that don’t need to be contiguously located in main memory during execution.
So it is not necessary that all of the pieces of an address space be in main memory during execution. Only the ones that I am “currently” accessing
Virtual Memory
Each program has its own address space
This address space is divided into pages (e.g 4kB)
Pages are mapped into physical memory chunks (called frames)
By definition then
sizeof(page) == sizeof(frame)
(the size is determined by the hardware manufacturer)
Secondary Storage
Main memory can act as a cache for the secondary storage (disk)
Advantages:
illusion of having more physical memory
program relocation
protection
Page Table Entry
Present bit: '1' if the values in this entry is valid, otherwise translation is invalid and an pagefault exception will be raised
Frame Number: this is the physical frame that is accessed based on the translation.
Protection bits: 'kernel' + 'w' specifies who and what can be done on this page if kernel bit is set then only the kernel can translate this page. If user accesses the page a 'privilege exception' will be raised. If writeprotect bit is set the page can only be read (load instruction). If attempted write (store instruction), a write protection exception is raised.
Reference bit: every time the page is accessed (load or store), the reference bit is set.
Modified bit: every time the page is written to (store), the modified bit is set.
Caching Disabled: required to access I/O devices, otherwise their content is in cpu cache
Multi-Level Page Table / RadixTree / Hierarchical Page Table
To reduce storage overhead in case of large memories.
For sparse address spaces (most processes) only few 2nd tables required
4 Level PageTable for 64-bit arch
OS has to make sure no segment is allocated into high range of address space (63-48 bits). If bits are set, the MMU will raise an exception (this is really an OS bug then).
General Formula of PageTable Management
The OS must create its page table and VMM management following the hardware definition.
Hardware defines the frame size as 2^N
(virtual) page size is typically equal frame size (otherwise it's a power-of-two multiple, but that is rare and we shall not base on this exception)
Everything the OS manages is based on pagesize (hence framesize)
You can compute all the semantics with some basic variables defined by the hardware:
Virtual address range (e.g. 48-bit virtual address, means that the hardware only considers the 48-LSB bits from an virtual address for ld/st, all others raise a SEGV error)
Frame size
PTE size (e.g. 4byte or 8byte)
From there you can determine the number of page table hierarchies, offsets
Examples:
Framesize = 4KB
→ 12bit offset to index into frame (12bits)PTE size = 8Bytes
→ 512 entries in each page table hierarchy (9bits/hierarchy)Virtual address range = 48bits
→12 + N*9 = 48
(N=4)The hardware does all the indexing as described before
Speeding Up Paging
Challenges:
Mapping virtual to physical address must be fast: we can not always traverse the page table to get the VA -> PA mapping Solution: Translation Lookaside Buffer(TLB)
If address space is large, page table will be large (but remember the sparsity, not fully populated) Solution: Multi-level page table
TLB
Observation: most programs tend to make a large number of references to a small number of pages over a period of time -> only fraction of the page table is heavily used ( data and instruction locality !!!)
TLB is the hardware cache inside the MMU. It caches PageTable translations (VA -> PA) and maps virtual to physical address without going to the page table (unless there’s a miss).
In case of TLB miss -> MMU accesses page table and load entry from pagetable to TLB. TLB misses occur more frequently than page faults.
TLB Management
TLB entries are not written back on access, just record the R and M bits in the PTE (it is a cache after all) upon access
On TLB capacity miss, if the entry is dirty, write it back to its associated PageTableEntry (PTE)
On changes to the PageTable, potential entries in the TLB need to be flushed or invalidated
"TLB invalidate" e.g. when mmap area disappears
"TLB flush" write back when changes in TLB to PageTable (either global or per address) must be recorded, think when a “TLB invalidate” might be insufficient
TLB based translation
Shared Pages (efficient usage of memory)
Shared code:
One copy of read-only (reentrant) code shared among processes (i.e., text editors, compilers, window systems)
Similar to multiple threads sharing the same process space
Also useful for inter-process communication if sharing of read-write pages is allowed
Private code and data:
Each process keeps a separate copy of the code and data
The pages for the private code and data can appear anywhere in the logical address space
Copy on Write (COW)
Copy-on-Write (COW) allows both parent and child processes to initially share the same pages in memory. If either process modifies a shared page, only then is the page copied
COW allows more efficient process creation as only modified pages are copied
In general, free pages are allocated from a pool of zero-fill-on-demand pages
Pool should always have free frames for fast demand page execution: Don’t want to have to free a frame when one is needed for processing on page fault
Why zero-out a page before allocating it? -> so we get a known state of a page.
vfork()
variation onfork()
system call has parent suspend and child using copy-on-write address space of parentDesigned to have child call
exec()
Very efficient
Page Replacement Algorithms
Optimal Page Replacement Algorithm
Each page labeled with the number of instructions that will be executed before this page is referenced.
Page with the highest label should be removed.
Impossible to implement (just used for theoretical evaluations)
The FIFO Replacement Algorithm
OS maintains a list of the pages currently in memory [that would be for instance frame table]
The most recent arrival at the tail
On a page fault, the page at the head is removed
In lab3: for simplicity you implement FIFO with a simple round robin using a
HAND == pointer
to the frametable
The Second-Chance Page Replacement Algorithm
Modification to FIFO
Inspect the R bit of the oldest page
If R=0 page is old and unused -> replace
If R=1 then
bit is cleared
page is put at the end of the list
the search continues
If all pages have R=1, the algorithm degenerates to FIFO
The Clock Page Replacement Policy
Keep page frames on a circular list in the form of a clock
The hand points to the oldest uninspected page
When page fault occurs
The page pointed to by the hand is inspected
If R=0
page evicted
new page inserted into its place
hand is advanced
If R=1
R is set to 0
hand is advanced
This is essentially an implementation of 2nd-Chance
The Clock Page Replacement Policy (an optimization)
When a page fault occurs, the page the hand is pointing to is inspected. The action taken depends on the R bit:
R = 0
: Evict the pageR = 1
: Clear R and advance hand
Tail moves in front and “processes” candidates, writing optimistically modified pages back to swap and clear “M” bit and R bit
The Not Recently Used (NRU) Replacement Algorithm / Enhanced Second Chance Algorithm
Two status bits with each page
R: Set whenever the page is referenced (used)
M: Set when the page is written / dirty
R and M bits are available in most computers implementing virtual memory
Those bits are updated with each memory reference
Must be updated by hardware
Reset only by the OS
Periodically (e.g. on each clock interrupt) the R bit is cleared
To distinguish pages that have been referenced recently
The Least Recently Used (LRU) Page Replacement Algorithm
Good approximation to optimal. When page fault occurs, replace the page that has been unused for the longest time. Realizable but not cheap.
Slow + Few machines (if any) have required hardware.
Hardware Implementation 1
64-bit counter increment after each instruction
Each page table entry has a field large enough to include the value of the counter
After each memory reference, the value of the counter is stored in the corresponding page entry
At page fault, the page with lowest value is discarded
Too expensive! Too slow!
Hardware Implementation 2
Machine with n page frames
Hardware maintains a matrix of n x n bits
Matrix initialized to all 0s
Whenever page frame k is referenced:
Set all bits of row k to 1
Set all bits of column k to 0
The row with lowest value is the LRU
Hardware Implementation 3
Maintain the LRU order of accesses in frame list by hardware
After accessing page 3:
Approximating LRU in Software
Not Frequently Used (NFU) algorithm
Software counter associated with each page, initially zero
At some periodicity (e.g. a second), the OS scans all page table entries and adds the R bit to the counter of that PTE
At page fault: the page with lowest counter is replaced
Still a lot of overhead ( not really practical either)
Aging Algorithm
NRU never forgets anything -> high inertia
Modifications:
shift counter right by 1
add R bit as the leftmost bit
reset R bit
This modified algorithm is called aging. Each bit in vector represents a period. The page whose counter is lowest is replaced at page replacement
The Working Set Model
Working set: the set of pages that a process is currently using
Thrashing: a program causing page faults at high rates (e.g. pagefaults/instructions metric)
OS must keep track of which pages are in the working set
Replacement algorithm: evict pages not in the working set
Possible implementation (but expensive): working set = set of pages accessed in the last k memory references
Approximations: working set = pages used in the last 100 msec or 1 sec (etc.)
Working Set Page Replacement Algorithm
WSClock Page Replacement Algorithm (specific implementation of Working Set Replacement)
Based on the clock algorithm and uses working set
data structure: circular list of page frames (clock)
Each entry contains: time of last use, R bit
At page fault: page pointed by hand is examined
If R = 1:
Record current time, reset R
Advance hand to next page
If R = 0:
If age > threshold and page is clean -> it is reclaimed
If page is dirty -> write to disk is scheduled and hand advances (note in lab3, we skip this step for simplicity reasons, but think why this is advantageous?)
Advance hand to next page
OS Involvement With Paging
When a new process is created
Determine how large the program and data will be (initially)
Create page table
Allocate space in memory for page table
Record info about page table and swap area in process table
When a process is scheduled for execution
TLB flushed for current process
MMU reset for the process
Process table of next process made current
When process exits
OS releases the process page table
Frees its pages and disk space
When page fault occurs: Page Fault Exception Handling
The hardware:
Saves program counter
Exception leads to kernel entry
An assembly routine saves general registers and calls OS.
OS tried to discover which virtual page is needed
OS checks address validation and protection and assign a page frame (page replacement may be needed)
If page frame selected is dirty
Page scheduled to transfer to disk
Frame marked as busy
OS suspends the current process
Context switch takes place
Once the page frame is clean
OS looks up disk address where needed page is
OS schedules a disk operation
Faulting process still suspended
When disk interrupts indicates page has arrived
OS updates page table
Faulting instruction is backed up to its original state before page fault and PC is reset to point to it.
Process is scheduled for execution and OS returns to the assembly routine.
The routine reloads registers and other state information and returns to user space.
Virtual Memory & I/O Interaction (Interesting Scenario)
Process issues a syscall()
to read a file into a buffer, the process suspended while waiting for I/O. New process starts executing, this other process gets a page fault. If paging algorithm is global there is a chance the page containing the buffer could be removed from memory, the I/O operation of the first process will write some data into the buffer and some other on the just-loaded page !
One solution: Locking (pinning) pages engaged in I/O so that they will not be removed by replacement algo
Backing Store
Swap area: not a normal file system on it.
Associates with each process the disk address of its swap area; store in the process table
Before process starts swap area must be initialized
One way: copy all process image into swap area [static swap area]
Another way: don’t copy anything and let the process swap out [dynamic]
Instead of disk partition, one or more preallocated files within the normal file system can be used [Windows uses this approach.]
Memory Mappings
Each process consists of many memory areas, aka:
segments
regions
VMAs virtual memory areas.
Examples: Heap, stack, code, data, ronly-data, etc.
Each has different characteristics
Protection (executable, rw, rdonly)
Fixed, can grow (up or down) [heap, stack]
Each process can have 10s-100s of these.
Memory Mapped Files
Maps a VMA -> file segment (see mmap()
's fd
argument )
It’s contiguous
On PageFault, it fetches the content from file at the appropriate offset into a virtual page of the address space
On "swapout", writes back to file (different versions though)
Benifits: Can perform load/store operations vs input/output
Organization of Memory Regions
Cells are by default non-overlapping. Called VMA (Virtual Memory Areas)
Organized as AVL trees
Identify in O(logN) time during pgfault:
Pgfault(vaddr)
→ VMARebalanced when VMA is added or deleted
VMA organization
Organized as balanced tree
Node [start - end]
On add/delete: rebalance
Lookup in
O(log(n))
time
Last updated