Chapter 16 The Page Cache and Page Writeback #80

jason--liu · 2021-07-18T08:04:26Z

Linux 内核实现了一种称为页面缓存（page cache）的磁盘缓存。这个缓存的目标是通过将数据存储在物理内存中来最小化磁盘 I/O，否则这些数据将需要访问磁盘。
为什么要有页面缓存？1.内存访问速度比硬盘快得多。2.时间和空间局部性原理。

缓存方法

页面缓存由 RAM 中的物理页面组成，其内容对应于磁盘上的物理块

For example, when a process issues the read() system call—it first checks if the requisite data is in the page cache. If it is, the kernel can forgo accessing the disk and read the data
directly out of RAM.This is called a cache hit. If the data is not in the cache, called a cache
miss, the kernel must schedule block I/O operations to read the data off the disk.After the data is read off the disk, the kernel populates the page cache with the data so that any subsequent reads can occur out of the cache. Entire files need not be cached

写缓存

一种方法是不使用缓存，但这种方法很少用，因为它不仅无法缓存写操作，而且还会使缓存无效，从而使它们成本高昂。
第二种方法是写操作会自动更新内存缓存和磁盘文件，称为write-through，这种方法的好处是保持缓存的一致性——对后备存储同步且有效——而无需使其失效。也很简单。
第三种也是Linux采用的是write-back。

In a write-back cache, processes perform write operations directly into the page cache.The backing store is not immediately or directly updated. Instead, the written-to pages in the page cache are marked as dirty and are added to a dirty list

周期地，脏列表中的页面在称为回写的过程中被写回磁盘，使磁盘上的副本与内存缓存保持一致

Cache Eviction

存的最后一部分是从缓存中删除数据的过程，要么为更多相关的缓存条目腾出空间，要么缩小缓存以使更多的 RAM 可用于其他用途

Least Recently Used

缓存驱逐策略试图用他们可以访问的信息来近似透视算法。LRU 驱逐策略需要跟踪每个页面何时被访问（或至少按访问时间对页面列表进行排序）并驱逐具有最旧时间戳的页面（或在排序列表的开头）。然而，LRU 策略的一个特殊失败是许多文件被访问一次，然后再也不会访问。因此，将它们放在 LRU 列表的顶部并不是最佳选择。不适用。

The Two-List Strategy

因此，Linux 实现了 LRU 的修改版本，称为双列表策略。Linux 保留两个列表，而不是维护一个列表，即 LRU 列表：活动列表和非活动列表。仅当页面已经驻留在非活动列表中时被访问，才会将其放置在活动列表中。列表保持平衡：如果活动列表变得比非活动列表大得多，活动列表头部的项目将移回非活动列表，使其可用于驱逐。

Linux Page Cache

页面缓存，顾名思义，是 RAM 中页面的缓存。页面源自对常规文件系统文件、块设备文件和内存映射文件的读写。

address_space对象

页面缓存中的一个页面可以由多个不连续的物理磁盘块组成，Linux 页面缓存旨在缓存任何基于页面的对象，其中包括多种形式的文件和内存映射。 Linux 内核中的其他许多内容一样，address_space 被错误命名。更好的名称可能是 page_cache_entity 或 physical_pages_of_a_file。address_space结构定义如下。

struct address_space {
	struct inode *host; 				/* owning inode */
	struct radix_tree_root page_tree; 	/* radix tree of all pages */
	spinlock_t tree_lock; 				/* page_tree lock */
	unsigned int i_mmap_writable; 		/* VM_SHARED ma count */
	struct prio_tree_root i_mmap; 		/* list of all mappings */
	struct list_head i_mmap_nonlinear; 	/* VM_NONLINEAR ma list */
	spinlock_t i_mmap_lock; 			/* i_mmap lock */
	atomic_t truncate_count; 			/* truncate re count */
	unsigned long nrpages; 				/* total number of pages */
	pgoff_t writeback_index; 			/* writeback start offset */
	struct address_space_operations *a_ops; /* operations table */
	unsigned long flags; 				/* gfp_mask and error flags */
	struct backing_dev_info *backing_dev_info; /* read-ahead information */
	spinlock_t private_lock; 			/* private lock */
	struct list_head private_list; 		/* private list */
	struct address_space *assoc_mapping;	/* associated buffers */
};

i_mmap 字段是这个地址空间中所有共享和私有映射的优先级搜索树，i_mmap 字段允许内核有效地找到与此缓存文件关联的映射。
地址空间中有nrpages个页面
host指向对应inode，如果相关对象没有inode，比如swapper，inode为NULL

address_space操作集

a_ops 字段指向地址空间操作表。定义如下

#include <linux/fs.h>
struct address_space_operations {
	int (*writepage)(struct page *, struct writeback_control *);
	int (*readpage) (struct file *, struct page *);
	int (*sync_page) (struct page *);
	int (*writepages) (struct address_space *,
			struct writeback_control *);
	int (*set_page_dirty) (struct page *);
	int (*readpages) (struct file *, struct address_space *,
			struct list_head *, unsigned);
	int (*write_begin)(struct file *, struct address_space *mapping,
			loff_t pos, unsigned len, unsigned flags,
			struct page **pagep, void **fsdata);
	int (*write_end)(struct file *, struct address_space *mapping,
			loff_t pos, unsigned len, unsigned copied,
			struct page *page, void *fsdata);
	sector_t (*bmap) (struct address_space *, sector_t);
	int (*invalidatepage) (struct page *, unsigned long);
	int (*releasepage) (struct page *, int);
	int (*direct_IO) (int, struct kiocb *, const struct iovec *,
			loff_t, unsigned long);
	int (*get_xip_mem) (struct address_space *, pgoff_t, int,
	void **, unsigned long *);
	int (*migratepage) (struct address_space *,
	struct page *, struct page *);
	int (*launder_page) (struct page *);
	int (*is_partially_uptodate) (struct page *,
			read_descriptor_t *,
			unsigned long);
	int (*error_remove_page) (struct address_space *,
			struct page *);
}

每个后备存储都描述了它如何通过自己的 address_space_operations 与页面缓存交互。例如，ext3 文件系统在 fs/ext3/inode.c 中定义了它的操作.
读操作：首先，Linux内核尝试在页面缓存中查找请求数据。 find_get_page() 方法用于执行此检查；它传递了一个address_space和page offset

page = find_get_page(mapping, index);

如果find_get_page返回NULL，则分配一个新页面并添加进page cache:

struct page *page;
int error;

/* allocate the page ... */
page = page_cache_alloc_cold(mapping);
if (!page)
    /* error allocating memory */

/* ... and then add it to the page cache */
error = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
if (error)
    /* error adding page to page cache */
//Finally, the requested data can be read from disk, added to the page cache, //and returned to the user:
error = mapping->a_ops->readpage(file, page);

写操作：写操作与读操作不同，仅仅标记页面脏

SetPageDirty(page);

内核会在将来合适的时候调用writepage()方法。通用调用路径如mm/filemap.c如下：

page = __grab_cache_page(mapping, index, &cached_page, &lru_pvec);
status = a_ops->prepare_write(file, page, offset, offset+bytes);
page_fault = filemap_copy_from_user(page, offset, buf, bytes);
status = a_ops->commit_write(file, page, offset, offset+bytes);

First, the page cache is searched for the desired page. If it is not in the cache, an entry is allocated and added. Next, the kernel sets up the write request and the data is copied from user-space into a kernel buffer. Finally, the data is written to disk

因为前面的步骤都是在所有的页面 I/O 操作期间执行的，所以所有的页面 I/O 都保证经过页面缓存。对于写操作，页面缓存充当写的中转站。

Radix树

因为内核必须在启动任何页面 I/O 之前检查页面缓存中页面的存在，所以这种检查必须很快。否则会引起很大的系统消耗。正如你在上一节中看到的，页面缓存是通过 address_space 对象加上一个偏移值来搜索的。每个address_space都有一个唯一的基数树存储为 page_tree。

A radix tree is a type of binary tree.The radix tree enables quick searching for
the desired page, given only the file offset. Page cache searching functions such as find_get_page() call radix_tree_lookup(), which performs a search on the given tree for the given object.

旧式哈希表

// TODO

Buffer Cache

单个磁盘块也通过块 I/O 缓冲区与页面缓存相关联.buffer是单个物理磁盘块的内存表示。页面缓存还通过缓存磁盘块和缓冲块 I/O 操作减少块 I/O 操作期间的磁盘访问。这种缓存称为buffer cache，属于page cache一种。

刷新线程

在页面缓存中写操作是延迟的。当页面缓存中的数据比后备存储中的数据新时，我们称该数据为脏数据。脏页写回在以下场景发生：

当空闲内存收缩到指定阈值以下时，内核会将脏数据写回磁盘以释放内存，因为只有干净（非脏）内存可用于逐出，清理时，内核可以从缓存中驱逐数据，然后缩小缓存，释放更多内存。
当脏数据超过特定阈值时，足够旧的数据会被写回磁盘，以确保脏数据不会无限期地保持脏状态
当用户进程调用 sync() 和 fsync() 系统调用时，内核会按需执行回写

当空闲内存低于dirty_background_ration时，内核调用wakeup_flusher_threads() 调用以唤醒一个或多个flusher 线程并让它们运行bdi_writeback_all() 函数以开始写回脏页直到下面两个条件满足：

指定的最小页数已写出
空闲内存量高于dirty_background_ratio阈值

周期性回写是很重要的，这是为了确保没有脏页无限期地保留在内存中。在系统启动时，一个定时器被初始化以唤醒一个刷新线程并让它运行 wb_writeback() 函数。然后这个函数写回所有修改时间超过dirty_expire_interval 毫秒前的数据
系统管理员可以在 /proc/sys/vm 或通过 sysctl 设置这些值

The flusher code lives in mm/page-writeback.c and mm/backing-dev.c and the writeback mechanism lives in fs/fs-writeback.c.

笔记本模式

笔记本模式是一种特殊的页面写回策略，旨在通过最大限度地减少硬盘活动并使硬盘驱动器尽可能长时间地保持低速运行来优化电池寿命。通过/proc/sys/vm/laptop_mode配置。除了在脏页太旧时执行回写外，刷新线程还搭载任何其他物理磁盘 I/O，将所有脏缓冲区刷新到磁盘。当dirty_expire_interval 和dirty_writeback_int比如说，10 分钟。写回如此延迟，磁盘很少旋转，当它启动时，笔记本模式确保机会得到充分利用erval 设置为大值时，这个模式最有意义。

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chapter 16 The Page Cache and Page Writeback #80

Chapter 16 The Page Cache and Page Writeback #80

jason--liu commented Jul 18, 2021 •

edited

Loading

Chapter 16 The Page Cache and Page Writeback #80

Chapter 16 The Page Cache and Page Writeback #80

Comments

jason--liu commented Jul 18, 2021 • edited Loading

缓存方法

写缓存

Cache Eviction

Least Recently Used

The Two-List Strategy

Linux Page Cache

address_space对象

address_space操作集

Radix树

旧式哈希表

Buffer Cache

刷新线程

笔记本模式

jason--liu commented Jul 18, 2021 •

edited

Loading