Paging In troduction OS -> usually takes one of two approaches when solving any space-management problem chop things chop-up page up into into fixedvariable-sized size pieces pieces Paging Segmentation the Atlas problems: - fragmented allocation hallenging over time

Instead of splitting up a process's address space into some number of variable-sized logical segments (voole, heap, stack) divide it into fixed-size units page Physical memory becomes an array of fixed-size slots called page-frames PA Physical Menory VA Process Page Page frames.

Challenge How to virtualize mennery with pages: avoid segmentation problems - minimal space and time overheads





Most important advantage of paging () flexibility -> support the abstraction effectively -> we won't make assumptions about direction of stack and heap. Simplicity (noybe)
 OS keeps a list of kee
 pages to allocate memory. -> To record where each virtual page of the address space is placed in memory, the OS usually keeps a <u>per-process</u> DS known as <u>page table</u>]

page table Stores address translations for each of the virtual pages of the address space. exception: per process DS inverted page table OS will manage many such tables. move <virtual add">, 7. cax. -> ignore inskuction fetch split it into 1) Virtual Page No. to Kanslate VA 2 offset

e.g. VA Size = 64 bytes  

$$\Rightarrow 6 \text{ bits for VA}$$
  
 $von \quad offset$   
 $b_5 \quad b_4 \quad b_3 \quad b_2 \quad b_1 \quad b_0$   
 $paye \quad size = 16 \quad bytes = 4 \quad bits = offset$   
 $2 \quad bits = select \quad one$   
 $of \quad 4$   
 $pages$   
 $movl \quad 21 \quad , \ \% easo$   
 $2 \quad J$   
 $0 \quad 1 \quad 0 \quad 1 \quad 0 \quad 1$   
 $T$   
 $page \quad 1 \quad use$   
 $page \quad table \rightarrow find ushich$   
 $physical ficance$   
 $v \quad page \quad no \quad 1$   
 $resides \quad inv.$   
 $PFN \quad (Physicid 
Frame #)$ 

0 0 0 VPN Add Trans eg 0 ł 0 Questions 1) Where are the page tables stored 2) What are the typical contents of the page table, and how big are the tables? (3) Can it be slow?

where are the page tables stored Page tables can get torribly large. eg. 32-bit address space 4 KB page size  $4 \text{ KB} = 2^{12} \text{ bytes}$ offset = 12 bits VPN = 20 bits 2° translations ~ million OS would have to manage per process assuming we need 4 bytes per PTE to hold the physical translation + some other stuff 4MB memory per process

What's actually in the Page Table? -> DS to map VPN to PFN -> simplest: linear page table just an array -> indexed by VPN page table [ VPN] ← PFN -> more advanced DS in later Chapters (page-fable entry) Contents of each PTE () Value bit \_\_\_\_\_ > e.g. when if process tries to access a program such address (invalid) I starts running generale trap to the OS unused the heap

ancial for supporting sparse address space -> remove the need to allocate physical frames for unused memory. (2) Protection bits - read write execute traps if different permission shon expected. 3 Present bit > physical memory or disk? (swapped out?) Swapping \_\_\_\_\_ address spaces larger than physical memory.

Swapping allows the OS to free up physical memory by moving rarely used pages to disk. (a) Dirty bit (5) Référence / accessed bit useful ) - to know which pages are Page replacement popular. 6) 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 PFN Figure 18.5: An x86 Page Table Entry (PTE) hadisale cechip user supervisor

valid bit? Why no P 🕝 p=1 => both present and valid may not be present but is valid  $\rho = 0 \implies$ OR not present not valid triggers trap to the OS use add " structure It keeps to determine if the page is valid, or not Illezal memory and thus perhaps should be swapped access back in.

Pagine: Also Too Slow -> Page Tables might be too big -> They can slow things down too. move 21 7. eax physical address 17 first fetch the proper PTE Slides impostant shift \* Executing program -> File to process memory view. \* All processes have some add' space. -> OS performs memory virtualisation with the help of hardware.

\* Responsibilities of OS during program load: -> Create add space -> load binger -> load binary -> Update PCB register state PCB (new) virtual  $PC = D \times D$ to physical promo lation SP = Dx FFFF Shack Veap ; Translation Reg Memory State Memory APIS -> User has no direct control on physical memory.

> no special system stack calls  $\uparrow$   $\uparrow$   $\uparrow$ BSS -> brk, sbrk Heap input void \* address long size Pata (minit) BSS Data (init) Text) make end make of BSS end = addr = end + Size \* NOLL => remap things return old mmap discontiguous allo cation add e.g: allocate 4096 bytes sbrk('o) w/ read + write Im prot La returns the prt = mmap (NULL, 10% PROT\_READ) cursent location of start addr PROT-WRITE, MAP\_ANONY 1 ..., -1, 0 3 BSS flags få ?? offset etext -> end of Text at edata -> end of data (initialized) program load end -> end of BSS boat time -> etext -> end of Text

-> printing adds of fr and Vars: /proc/<pid>/ maps Linux: Example: Memory State of PCB -> Circular list START and END never everlap b/w two segment areas. -> Can be merged/extended if permission matched. (Mem state) 

Inheriting Address Space through fork () ¥ -> child inherit the memory state of the parent S data skucture is copied into child PCB Any change through mmap() os brk() is per-process \* Overriding Address Space through exec exec -> Add is reinitialized using the now executable -> Changes to newly created As depends on the logic of new process

Address Space Granularity \* Translation at address space granularity -> ISA, ×86 nov 7. rcx, 7. var register mor \$5, % rax immediate absolute mor 8000001, 7. 8ax mov. (Torcx), Zrax indirect -16 (7, rbp), 7, rax displacement mor  $\rightarrow$  examples does not know stack -> compiles : address, hence blindly uses the register (rbp and rsp) (rbp and rsp) 1, top of stack base/frame pointes stack year

OS during binary load load\_new\_executable ( PCB \*cutsent, File \*ere)} verify (exe) reinit addr-space (cutsent -> monstate); allocate\_phys\_men (ussent) load-exe-to-phys\_men (current, exe); VAS { set \_ uses \_ sp ( current -> mm\_state -> Stack \_ stast); ( set \_ user \_ pc ( current -> mm\_state -> xde \_ stast) retrien to coser Process State After exec() -> OS configures the base register depending on the physical loc"

PCB CPU user execution state PC = 0PC = OSP = 8KBBase = 20KB SP = 8 KB $T_{Base} = 20 \text{ kB}$ 20KB instrfetch (vaddr = 10) 28KB → instr fetch (paddr = 20KB + 10) "push % rbp" RAM -> Assuming RSP = 8KB "push 7. rbp" -> results in a memory store at addr (8KB - 8) CPU translates the add to (28 KB - 8)

· -> How to stop illegal accesses ? VA size = 8 KB Access = 20 KBphy Add = 40 KB X -> Limit Register -> can be changed from privileged mode raises a fault if the → Hardware program violates the limit OS fault bandles may kill the process. PCB" CPU user exe PC =D PC = 0SP = 878 T-Base = 20 T-limit = 8 SP = 8KBBase = 20KB limit = & KB

: How is memory isolation achieved? -> lîmît register.

Context switch Base and limit register values saved in the ontgoing process FCB during context switch.

-> Loaded from PCB to the CPU when a process is scheduled

Disadvantages of Translation at AS Granulasity -> Physical memory must be

greater stan AS size. -> against abstraction philosophy

-> Menosy inefficient -> Physical memory size is same as adde space size irrespective of actual usage -> Memory wastage

- Degree of multiprogramming is very less-

Segmentation -> Extension of the basic sheme w/mose base - limit register pairs.  $\rightarrow$  Ex: code add<sup>r</sup> data add stack add"

Segmentation : Explicit Addressing -> Past of the code is used to explicitly specify segments. -) 8KB Stack Segment 20KB Stack Base = 21KB 7KB 21KB Limit = 1KB  $\downarrow \downarrow \downarrow \downarrow$ Free 22KB Data Segment <u>† † †</u> 3KB Base = 32KB 23KB Heap Limit = 2KB32KB 2KB Data **Code Segment** 1KB Base = 22KB Text 34KB Limit = 1KB 0 (Instructions 48KB RAM VA = 8 KB; add length = 13 bits; 3 segments -> Two MSBs used to specify the segment: 00 → code 01 -> data  $\parallel \rightarrow \text{stack}$ 

-> hardware selects the segment register based on the value of 2 MSB bits and the rest of the bits ale used as offset. -> Max size of each segment = 2KB -> Physical allocation is still done on an on-denad Disadvantages with explicity addressing -> Inflexible - Data and Stack cannot be sized dynamically -> Wastage of VA space, -> in our example, 2KB VA is musable is musable

Segmentation : Implicit Addressing -> Hardware selects the segment register based on the operation -> Code segment for instruction access -> fetch add<sup>r</sup>, jump target. call add<sup>r</sup> -> Stack segment for stack operations > pash pop. indirect addressing with SP, BP -> Data segment for other addr

Segmentation: (Protection and Direction Flags [Limit] Base Plags RWX enforce iso lation and shaning to calculate H/W add

Segmentation in Reality DTR: descripter table register is used to access descriptos table → # descriptoss depend on auchitectuse -> Separate descriptors for user and kernel mode Flags Limit Flags D DTR CS Base Flags Limit Read DS Direction Privilege Write (+ or -) Base Execute SS Limit Flags CPU Base **Descriptor Table** Advantages of Segmentation -> Easy and efficient addr translation -> Save memory wastage for unused addresses

Disadvartages -> External fragmentation -> Cannot support discontiguous sparse mapping -> allous discontiguous sparse mapping -> Basic îdea : 1) Pastition the address space into fixed size blocks (all it pages) (2) Pastition physical memory in a sinnlar way (call it page frames) 3 OS: page mapping page frames

(2) H/W uses the mapping to panslate VA to PA. -> Paging example (Pages) 32 × 1024  $2^{\circ} 2^{10}$ VA Size = 32 KB, Page size = 256 bytes ~> 2<sup>8</sup> Add<sup>\*</sup> length = 15 bits  $\{0 \times 0 - 0 \times 7FFF\}$ # of pages =  $\frac{32 \text{ KB}}{\frac{1}{4} \text{ KB}}$ = 128 Example : For  $VA = 0 \times 0510$ page number offset = 16 = 5 8 bib 7614 Pose # Offset

Paging example (Page frames)

Physical Add size = 64 KB  $\frac{Add^{r}}{dt} = \frac{16}{bits} \frac{10 \times 0 - 0 \times FFFF}{2}$   $\frac{1}{dt} = \frac{16}{pt} = \frac{16}{256}$ 

e.g: For PA = 0x 1F51 PFN = 31 offset = 81

Page Table Mapping: -> each entry in page table is called a Page - Table Entry (PTE)

Page Table Walk

PTW (vadde V, PTable P) // returns physical address { Entry = P[V>>8] if (Entry. present) return (Entry. PFN<<8) + (V20...EP) Raise Page Fault; 2 Where is page table stored? -> Page table is stored in the RAM . Page table registes (CR 3 in x86) contains the addr.

Structure of PTE reserved 8 bits  $\sim$ PFN [] ) X ) P | A | S | W | P P→ present bit → entry is valid  $W \rightarrow usite bit \rightarrow usite allowed$ S→ privilege Bit, D→ only kernel A→ accessed bit, > set by H/W during walk D -> divry bit -> set by H/W husing wolk X -> execute bit (inskuction fetch) Maximum Size of physical memory → PFN → 8 bits page size = 256 bytes = + KB

 $\therefore Max RAM size = 2^8 \times \frac{1}{4} KB$ = 64 KB Address Translation: Multilevel Paging and TLB -> With increased addr space size, Single level page table entry is not feasible: -> 1 page size -> fragmentation 1 -> Small pages may not be snitable to hold all mapping entries. \* Example: Paging with 32-bit AS -> Sol<sup>n</sup>: Multi-level page table.

Two-level Page Tables (32-bit VA) -> Level -1 page table contains entries pointing to level -2 page table stanctures -> 12 entry contrains PFN + flags 12 bits 4KB block size 10 bits 10 bits U-offset ) Page offset Physical frame 4KB - 12 entry PFN CR3

## -> 4 level page table : 48-bit VA (x86\_64

## 4-Level Page Tables: 48-bit VA (x86\_64)



## 4-Level Page Tables: Example Translation



CR3

H/W translation by repeated access of page table store in physical memory

□ Page table entry: 12 bits LSB is used for access flags

Translation efficiency 4-level page table, how many accesses Consider sum = D;required for (cts = 0; cts < 10; cts + +)for translation sum + = CRAssume no cache 0x20100: mov \$0 %rax; //sum=0 0x20102: mov %rax, (%rbp); 0x20104: mov \$0, %rcx; //ctr=0 0x20106: cmp \$10, %rcx; //ctr<10 0x20109: jge 0x2011f; ← //jump if>= 0x2010f: add %rcx, %rax;← 0x20111: mov %rax, (%rbp);← //sum+=ctr 0x20113: inc %rcx; //ctr++ 0x20115: jmp 0x20106; ← //loop 0x2011f: ret;  $bop = 10 \times 6$ Instruction execution: others = 5 Memory accesses during translation  $= 65 \times 4 = 260$ 

Data/stack access: initialization = 1 loop = 10 Memory access during translation = 11 \* 4 = 44 Pistinct accesses? Assume: stack address range 0x 7FFF000 - 0x 800000 One code page (0x20) and one stack page -> cache these : TLB Ox 7FFF

Paging N/ TLB: Franslation efficiency

TLB PTE Page 0x 20 Dx 750 0×7FF Ox 890

Translate (vaddr V) { PageAddress P = V>>12; TLBENtry = Lookup (P); if (entry. valid) return entry. pte; entry = Page Table Walk (V); Make Entry (entry); return entry. pte; J

TLB -> Translation Lookaside Buffer -> hardware cache -> stores Page to PFN mapping → first miss for inste fefch, all accesses hit Address Translation (TLB + PTW) VA (48 Lits) VA (48 Lits) TLB Logic TLB mit PA (48 bib) TLB insert PT Walk Page Tables

→ Seperate TLBs for instruction an data, multilevel TLBs -> In x86, OS cannot make entries into the TLB directly, it must flush entries \* How is TLB shared across multiple processes? Option 1: flush whole TLB -> performance problem Option 2: Address Space Identifier (ASID) along with each TLB entry to identify the process.

Process (A)

Process (B)

| ASID | Page  | PTE      |
|------|-------|----------|
| A    | 0x200 | 0x200007 |
| В    | 0x200 | 0x300007 |
| А    | 0x201 | 0x205007 |
| В    | 0x201 | 0x305007 |

## TLB

Why is page fault necessary? -> Page fault is required to support memory over-constituent through lazy allocation and swapping.

Pagefault Handling in x86

if (! pte. valid 11 (access == write && ! pte write) || (cpl!=0 && pte.prir==0)){ CR2 = address; error code = pte. valid ) access << 1 | cpl << 2; Raise Page Fault; 3 // simplified cpl = current priviledge level pte. priv = priv level of pte. → error code is pushed into the kernel stack by the Cardware (x&)

OS Fault Handler virhal Handle Page Fault ( UG4 address, UG4 error code) entry = P[V >> 8];if (AddressExists (current -> mm\_state, address) && Access Permitted ( usrent -> mm\_state, error\_code)){ PFN = allocate\_pfn(); install\_pte (address, PFN); return; Z Raise Signal (SIGSEGV); J

Page Fault and Swapping Swapping (swap-out) DRAM SWAP (Hard disk) Allocate PFN(): free PFNs 1 cannot break promise to applications, ... swap out, but which one? Decide using a page replacement policy

DRAM XFF 1 Page Replacement Policy SWAP (Hard disk) Update the present bit of victim entry to O -> page fault Replace PFN with Swap Address

present Befor PAN (v) - flags 1 After Swap Address (v)  $\downarrow$ same flags TV Hard disk Any future translation will result in page fault. Page fault Handler will copy it back from. the swap device

Swap - in : virtual Handle Page Fault (u64 address, u64 error code) entry = P[V >> 8];if (AddressExists (current -> mm\_state, address) && Access Permitted ( current -> mm\_state, error-code)){ PFN = allocate\_pfn(); if ( is\_swapped\_pte (address)) Swap in (getPTE (address), PFN); install\_pte (address, PFN); return; Raise Signal (SIGSEGV);

Page Replacement Objective: Minimize # of page faults (due to swapping) -> 3 parameters -> A given sequence of accesses to virtual pages  $\rightarrow$  # of memory pages (frames) -> page seplacement policies -> Metrics to measure effectiveness -> # of page faults -> Page fault rate -> fraction of memory -> AMAT accesses that result in page fault

Belady's Optimal Algorithm (MIN) -> Strategy: Replace the page that will be referenced after the Congest time. → Example :  $\rightarrow$  # of frames = 3 -> Reference sequence (in temporal order) 1,4,5 1,2,5  $1 \rightarrow 3 \rightarrow 1 \rightarrow 5 \rightarrow 4 \rightarrow 1 \rightarrow 2 \rightarrow 5$   $\underbrace{\text{Yes}}_{\text{Yes}} \text{Yos} \text{No} \underbrace{\text{Yes}}_{3} \underbrace{\text{Replace}}_{3} \text{No} \underbrace{\text{Replace}}_{4} \underbrace{\text{No}}_{4}$  $3 \leftarrow 5 \leftarrow 2 \leftarrow 2$ Yes No No No. # of page faults = 6

Belady's MIN algorithm is proven to be optimal, but impractical as it requires knowledge of future accesses.

FIFO Strategy: Replace the page that is in the memory for the longest fime

Ex: # of frame = 3  $1 \quad 1,3 \quad 1,3 \quad 1,3,5 \quad 4,3,5 \quad 4,1,5 \quad 4,1,2 \quad 5,1,2 \\ 1 \rightarrow 3 \rightarrow 1 \rightarrow 5 \rightarrow 4 \rightarrow 1 \rightarrow 2 \rightarrow 5 \\ \underline{Y_{CS}} \quad \underline{Y_{CS}} \quad N_{\circ} \quad \underline{Y_{CS}} \quad \underline{Y_$  $3 \leftarrow 5 \leftarrow 2 \leftarrow 2$ Yes No No No

8 page faults - 3 cold start

Least Recently Used (LRV) Replace the page that is not referenced for the longest time. Eq: # of frame = 3  $3 \leftarrow 5 \leftarrow 2 \leftarrow 2 \leftarrow 5$ Yes No No No No -> LRM is shown to be useful for workloads with access locality -> approximate using CLOCK