This is an implementation of a simple 64 bit kernel for x86 processors. It's a small project just to learn how operating systems work and understand recent security meltdowns.
- Memory management
- Buddy allocator, Slab allocator
- Pagging
- Virtual memory areas (VMA)
- Transparent Huge Pages (THP)
- Page canaries, Use-after-free, Double free
- User/kernel address space
- RIDL mitigation
- Foreshadow mitigation
- KPTI (todo)
- ASLR (todo)
- Demand paging
- Process Management
- Interrupts
- Preemptive schedulling
- Syscalls
- vdso (todo)
- Many processes
- User and Kernel threads
- Opening ELF binaries
- Scheduling
- Round-robin scheduler
- IPC messaging
- Interrupts
- Multi Core
- Core pinning
- Protect kernel with locks
- Swapping
- OOM Killer
- Background deamon for swapping
This section gives a short general overview over our design by describing how the two components - OOM killer and swapping - interact together, before describing these subsystems and their data structures in more detail.
The OOM killer and swapping are part of the memory management subsystem and only are fully active when there is a shortage of memory. In our kernel they are used synchronously and asynchronously.
Synchronously: If the buddy allocator is unable to allocate a (huge) page, because of memory shortage, it tries to synchronously swap out pages until it's possible again to allocate the (huge) page. If there is not enough swap space to satisfy the allocation's need buddy synchronously calls the OOM killer. It calculates the badness score for all existing processes and kills the one with the highest. After freeing all resources buddy should be able to allocate the page again.
Asynchronously: When booting our kernel we also start a kernel task swapd
which will only take action if the free memory drops below 20%. Even then it'll only run in intervals of SWAPD_SCHEDULE_TIME_BLOCK
. If the memory is below 20% swapd
keeps track of the LRU list and updates it regularly according to the PAGE_ACCESSED
bit in the page table entries (PTEs) where the page is mapped. Once the free memory drops below 10% it additionaly starts swapping out pages and therefore tries to keep at least 10% free memory available.
The design of the OOM killer is straightforward. It gets only synchonously called from buddy if there is no physical memory available and we are out of swap space. It calculates the badness score for each user task as follows: active phsycial pages*2 + swapped pages (if zombie + 5000)
. Then it prints the scores, chooses the task with the highest badness score and kills it. Therefore, it effectifely frees all the resources including user allocated physical pages, page tables and swap space. On success it returns control to buddy, which in turn then should be able to allocate a page. If we happen to kill the current task (because it had the highest badness score) then buddy will just yield, so that another task can be executed.
To optimize the calculation time of the badness scores the kernel keeps track of active physical pages and used swap space in the task struct
whenever there's a page allocated, swapped in/out and dealloacted.
- counting of physical pages and swapped pages, so calculating badness scores is very fast
- easy to understand (and compute) badness score: active phsycial pages*2 + swapped pages
- zombies get bump of 5000 to make them very likely to get reaped first
- does not take page tables into account
The circular linked list lru_pages
keeps track of all user pages in a least recently used manner. It is regularly updated by swapd
according to the PAGE_ACCESSED
bit in the PTEs where the page is mapped. When a page needs to be swapped the second chance / CLOCK algorithm is used to efficiently determine the next page to be swapped out.
In our design, reverse mapping rmap is implemented as an object that maintains a list of pointers to rmap_elems. Each rmap_elem is a link between page table entry, rmap and a task, as shown on the Figure 1. This approach allows us to replace content of all page table entries that point to the same physical page, when the page is being swapped out.
Figure 1. Reverse mapping design
The swap out procedure is performed in the following way:
- swap_out procedure knows the page_info it should transfer to the disk. Firstly, it extracts rmap object from page_info and iterate over all the elements in elem list.
- For each rmap_elem we update the page table entry for each pml4 structure as we store a direct pointer to it in entry field of rmap_elem. We clear PAGE_PRESENT bit and set the PAGE_SWAP bit in the page table entry and replace the current physical address of the page with an offset within the global array of disk representation. The design on the global array is shown below.
Figure 2. Global array of disk representation
Below, we present an example(Figure 3) that shows swapping out a page that was used in many page table entries by different tasks. When swapping out a page, we can easily identify all the rmap_elems, as they are stored in pp_rmap field of page_info. Therefore, we can easily update necessary page table entries.
Figure 3. Swap out a page that was held by many page table entries
Swap in mechanism starts from the page table entry that has PAGE_SWAP bit set. We extract offset to the global disk representation and retrive the pointer to the rmap. Then we alloc a new page with page_alloc and copy over the data from the disk. Finally we find all the page table entries that we need to update by interating over elems list of rmap.
Disk opeartions are by far the most expensive part of swapping in and out. Therefore, whenever a task needs to wait for the disk device, we interrupt the task and move it to the waiting queue. However, in our codebase we give user an option to turn on and off the async feature by specifying a relevant flag SWAP_SYNC_BACKGROUND
or SWAP_SYNC_DIRECT
.
Our implementation supports management of a swapped-out page as it was in RAM memory. So, whenever we want to decrement or increment the reference counter of a swapped-page, we perform it on swapped-out data. When the reference counter is decremented to 0, we remove the page directly from the disk.
To support multicore operations, we use locks for synchronisation. In our implementation we have 3 locks for managing different critical sections.
- Reverse mapping lock
Each rmap object, which links to a specific page_info object, has a lock that is always locked whenever there is any operation involving this page. - Disk device lock
Since the access to the disk should be granted only to one execution at one time, we use a lock to prevent other cores to use disk interface at the same time. - Task lock
During swapping out a shared page by two different processes, we need to ensure that the task that doesn't trigger swapping, will also update its TLB cache.
Our kernel supports the swapping out and in of transparent and user huge pages.