Osdev-Notes

Paging

What is Paging?

Paging is a memory management scheme that introduces the concept of logical addresses (virtual address) and virtual memory. On x86_* architectures this is achieved via hardware. Paging enables a layer of translation between virtual and physical addresses, and virtual and physical address spaces, as well as adding a few extra features (like access protection, priviledge level protection).

It introduces a few new concepts that are explained below.

Page

A page is a contiguous block of memory, with the exact size depending on what the architecture supports. On x86_64 we have page sizes of 4K, 2M and optionally 1G. The smallest page size is also called a page frame as it represents the smallest unit the memory management unit can work with, and therefore the smallest unit we can work with! Each entry in a page table describes one page.

Page Directories and Tables

These are the basic blocks of paging. Depending on the architecture (and requested page size) there can be a different number of them.

What are those directories and tables? Let’s start from the tables:

A special register, CR3 contains the address of the root page directory. This register has the following format:

Sometimes CR3 (although technically it’s just the data from bits 12+) is referred to as the PDBR, short for Page Directory Base address.

Virtual (or Logical) Address

A virtual address is what a running program sees. Thats any program: a driver, user application or the kernel itself.

Sometime in the kernel, a virtual address will map to the same physical address, this scenario it is called identity mapping, but this is not always the case though, we can also have the same physical address that maps to different virtual addresses.

A virtual address is usually a composition of entry numbers for each level of tables. The picture below shows, with an example, how address translation works:

Address Translation

The memory page in the picture refers to a physical memory page (the picture above doesn’t refer to any existing hardware paging, is just an example scenario). Using logical address and paging, we can introduce a whole new address space that can be much bigger of the available physical memory.

For example we can have that:

phys(0x123'456) = virt(0xFFF'F234'5235)

Meaning that the virtual address 0xFFFF2345235 refers to the phyisical address 0x123456.

This mapping is usually achieved through the usage of several hierarchical tables, with each item in one level pointing to the next level table. As already mentioned above a virtual address is a composition of entry numbers for each level of the tables. Now let’s assume for example that we have 3 levels paging, 32 bits addressing and the address translation mechanism used is the one in the picture above, and we have the virtual address below:

virtaddress = 0x2F880120

Looking at the picture above we know that the bits:

We can translate the above address to:

The above example is just an imaginary translation mechanism, we’ll discuss the actual x86_64 4-level paging below. If we are wondering how the first page directory can be accessed, this will be clear later, but the answer is that there is usually a special register that contains the base address of the root page directory (in this example page dir 1).

Paging in Long Mode

In 64 bit mode we have up to 4 levels of page tables. The number depends on the size we want to assign to each page. It’s worth noting that newer cpus do support a feature called la57 (large addressing using 57-bits), this just adds another layer of page tables on top the existing 4 to allow for a larger address space. It’s a cool feature, but not really required unless we’re using crazy amounts of memory.

There are 3 possible scenarios:

To implement paging, is strongly reccomended to have already implemented interrupts too, specifically handling #PF (vector 0xd).

The 4 levels of page directories/tables are:

The number of levels depend on the size of the pages chosen. If we are using 4Kib pages then we will have: PML4, PDPR, PD, PT, while if we go for 2Mib Pages we have only PML4, PDPR, PD, and finally 1Gib pages would only use the PML4 and PDPR.

Page Directories and Table Structure

As we have seen earlier in this chapter, when paging is enabled, a virtual address is translated into a set of entry numbers in different tables. In this paragraph we will see the different types available for them on the x86_64 architecture.

But before proceeding with the details let’s see some of the characteristics common between all table/directory types:

The hierarchy of the tables is:

Is important to note that the x86_64 architecture support mixing page sizes.

In the following paragraphs we will have a look with more detail at how the paging is enabled and the common parts between all of the entries in these tables, and look at what they mean.

Loading the root table and enable paging

Until now we have explained how address translation works now let’s see how the Root Table is loaded (in x86_64 is PML4), this is done by loading the special register CR3, also known as PDBR, we introduced it at the beginning of the chapter, and is contents is basically the base address of our PML4 table. This can be easily done with two lines of assembly:

   mov eax, PML4_BASE_ADDRESS
   mov cr3, eax

The first mov is needed because cr3 can be loaded only from another register. Keep in mind that in order to enter long mode we should have already paging enabled, so the first page tables should be loaded very early in the boot process. Once enabled we can change the content of cr3 to load a new addressing space.

This can be done using inline assembly too:

void load_cr3( void* cr3_value ) {
    asm volatile("mov %0, %%cr3" :: "r"((uint64_t)cr3_value) : "memory");
}

The inline assembly syntax will be explained in one of the appendices chapter: C Language Info. The mov into a register here is hidden, by the label "r" in front of the variable cr3_value, this label indicates that the variable value should be put into a register.

The bits that we need to set to have paging enabled in long mode are, in order, the: PAE Page Address Extension, bit number 5 in CR4, the LME Long Mode Enable Bit (Bit 8 in EFER, and has to be loaded with the rdmsr/wrmsr instructions), and finally the PG Paging bit number 31 in cr0.

Every time we need to change a value of a system register, cr*, and similar we must always load the current value first and update its content, otherwise we can run into troubles. And finally the Paging bit must be the last to be enabled.

Setting those bits must be done only once at early stages of boot process (probably one of the first thing we do).

PML4 & PDPR & PD

PML4 and PDPR entry structures are identical, while the PD one has few differences. Let’s begin by looking at the structure of the first two types:

63 62 … 59 58 … 52 51 … 40 39 … 12 11 … 9
XD PK or available Available Reserved must be 0 Table base address Available
8 … 6 5 4 3 2 1 0
Reserved A PCD PWT U/S R/W P

Where Table base address is a PDPR table base address if the table is PML4 or the PD base address if the table is the PDPR.

Now the Page Directory (PD) has few differences:

Page Table

A page table entry structure is still similar to the one above, but it contains few more bits that can be set:

63 62 … 59 58 … 52 51 … 40 39 … 12 11 … 9
XD PK or available Available Reserved must be 0 Page Base Address Available
8 7 6 5 4 3 2 1 0
G PAT D A PCD PWT U/S R/W P

In this table there are 3 new bits (D, PAT, G) and the page base address, as already explained, is not pointing to a table but to the physical memory this page represents.

In the next section we will go through the fields of an entry.

Page Table/Directory Entry Fields

Below is a list of all the fields present in the table entries, with an explanation of the most commonly used.

Note about PWT and PCD, the definiton of those bits depends on whether PAT (page attribute tables) are in use or not. For a better understanding of those two bits please refer to the most updated intel documentation (is in the Paging section of the intel Software Developer Manual vol.3)

Address translation

Address Translation Using 2MB Pages

If we are using 2MB pages this is how the address will be handled by the paging mechanism:

         
63 …. 48 47 … 39 38 … 30 29 .. 21 20 … 0
1 … 1 1 … 1 1 … 0 0 … 0 0 … 0
Sgn. ext PML4 PDPR Page dir Offset

Every table has 512 elements, so we have an address space of $2^{512}2^{512}2^{512}*0x200000$ (that is the page size)

Address translation Using 4KB Pages

If we are using 4kB pages this is how the address will be handled by the paging mechanism:

           
63 … 48 47 … 39 38 … 30 29 … 21 20 … 12 11 … 0
1 … 1 1 … 1 1 … 0 0 … 0 0 … 0 0 … 0
Sgn. ext PML4 PDPR Page dir Page Table Offset

Same as above: Every table has 512 elements, so we have an address space of: $2^{512}2^{512}2^{512}2^{512}0x1000$ (that is the page size)

Page Fault

A page fault (exception 14, triggers the interrupt of the same number) is raised when address translation fails for any reason. An error code is pushed on to the stack before calling the interrupt handler describing the situation when the fault occured. Note that these bits describe was what was happening, not why the fault occured. If the user bit is set, it does not necessarily mean it was a priviledge violation. The CR2 register also contains the address that caused the fault.

The idea of the page fault handler is to look at the error code and faulting address, and do one of several things:

The error code has the following structure:

           
31 …. 4 4 3 2 1 0
Reserved I/D RSVD U/S W/R P

The meanings of these bits are expanded below:

Accessing Page Tables and Physical Memory

Recursive Paging

One of the problems that we face while enabling paging is of how to access the page directories and table, in case we need to access them, and especially when we need to map a new physical address.

There are two ways to achieve it:

To use the recursion the only thing we need to do, is reserve an entry in the root page directory (PML4 in our case) and make its base address to point to the directory itsef.

A good idea is to pick a number high enough, that will not interfer with other kernel/hardware special addresses. For example let’s use the entry 510 for the recurisve item

Creating the self reference is pretty straightforward, we just need to use the directory physical address as the base address for the entry being created:

pml4[510l] = pml4_physical_address | PRESENT | WRITE;

This should be done again when setting up paging, on early boot stages.

Now as we have seen above address translation will split the virtual address in entry numbers for the different tables, starting from the leftmost (the root). So now if we have for example the following address:

virt_addr = 0xff7f80005000

The entries in this address are: 510 for PML4, 510 for PDPR, 0 for PD and 5 for PT (we are using 4k pages for this example). Now let’s see what appens from the point of view of the address translation:

This means that by carefully using the recursive item from PML4 we can access all the tables.

Few more examples of address translation:

This technique makes it easy to access page tables in the current address space, but it falls apart for accessing data in other address spaces. For that purpose, we’ll need to either use a different technique or switch to that address space, which can be quite costly.

Direct Map

Another technique for modifying page tables is a ‘direct map’ (similar to an identity map). As we know an identity map is when a page’s physical address is the same as its virtual address, and we could describe it as: paddr = vaddr. A direct map is sometimes referred to as an offset map because it introduces an offset, which gives us some flexibility. We’re using to have a global variable containing the offset for our map called dmap_base. Typically we’ll set this to some address in the higher half so that the lower half of the address space is completely free for userspace programs. This also makes other parts of the kernel easier later on.

How does the direct map actually work though? It’s simple enough, we just map all of physical memory at the same virtual address plus the dmap_base offset: paddr = vaddr - dmap_base. Now in order to access a physical page (from our PMM for example) we just add dmap_base to it and we can read and write to it as normal.

The direct map does require a one-time setup early in your kernel, as you do need to map all usable physical memory starting at dmap_base. This is no more work than creating an identity map though.

What address should you use for the base address of the direct map? Well you can put it at the lowest address in the higher half, which depends on how many levels of page tables you have. For 4 level paging this will 0xffff'8000'0000'0000.

While recursive paging only requires using a single page table entry at the highest level, a direct map consumes a decent chunk of address space. A direct map is also more flexible as it allows the kernel to access arbitrary parts of physical memory as needed, . Direct mapping is only really possible in 64-bit kernels due to the large address space made available, 32-bit kernels should opt to use recursive mapping to reduce the amount of address space used.

The real potential of this technique will unveil when we have multiple address spaces to handle, when the kernel may need to update data in different address spaces (especially the paging data structures), in this case using the direct map it can access any data in any address space, by only knowing its physical address. It will also help when we will start to work on device drivers (out of the scope of this book) where the kernel may need to access the DMA buffers, that are stored by their physical addresses.

Troubleshooting

There are few things to take in account when trying to access paging structures using the recursion technique for x86_64 architecture:

 warning: result of ‘510 << 30’ requires 40 bits to represent, but ‘int’ only has 32 bits