You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Filesystems that are not disk-based (a virtual memory–based filesystem, such as sysfs, for example) generate the superblock on-the-fly and store it in memory
#include<linux/fs.h>structsuper_block {
structlist_heads_list; /* list of all superblocks */dev_ts_dev; /* identifier */unsigned longs_blocksize; /* block size in bytes */unsigned chars_blocksize_bits; /* block size in bits */unsigned chars_dirt; /* dirty flag */unsigned long longs_maxbytes; /* max file size */structfile_system_types_type; /* filesystem type */structsuper_operationss_op; /* superblock methods */structdquot_operations*dq_op; /* quota methods */structquotactl_ops*s_qcop; /* quota control methods */structexport_operations*s_export_op; /* export methods */unsigned longs_flags; /* mount flags */unsigned longs_magic; /* filesystem’s magic number */structdentry*s_root; /* directory mount point */structrw_semaphores_umount; /* unmount semaphore */structsemaphores_lock; /* superblock semaphore */ints_count; /* superblock ref count */ints_need_sync; /* not-yet-synced flag */atomic_ts_active; /* active reference count */void*s_security; /* security module */structxattr_handler**s_xattr; /* extended attribute handlers */structlist_heads_inodes; /* list of inodes */structlist_heads_dirty; /* list of dirty inodes */structlist_heads_io; /* list of writebacks */structlist_heads_more_io; /* list of more writeback */structhlist_heads_anon; /* anonymous dentries */structlist_heads_files; /* list of assigned files */structlist_heads_dentry_lru; /* list of unused dentries */ints_nr_dentry_unused; /* number of dentries on list */structblock_device*s_bdev; /* associated block device */structmtd_info*s_mtd; /* memory disk information */structlist_heads_instances; /* instances of this fs */structquota_infos_dquot; /* quota-specific options */ints_frozen; /* frozen status */wait_queue_head_ts_wait_unfrozen; /* wait queue on freeze */chars_id[32]; /* text name */void*s_fs_info; /* filesystem-specific info */fmode_ts_mode; /* mount permissions */structsemaphores_vfs_rename_sem; /* rename semaphore */u32s_time_gran; /* granularity of timestamps */char*s_subtype; /* subtype name */char*s_options; /* saved mount options */
};
For Unix-style filesystems, this information is simply read from the on-disk inode. If a filesystem does not have inodes, however, the filesystem must obtain the information from wherever it is stored on the disk. Filesystems without inodes generally store file-specific information as part of the file
内核里面inode由struct inode表示
structinode {
structhlist_nodei_hash; /* hash list */structlist_headi_list; /* list of inodes */structlist_headi_sb_list; /* list of superblocks */structlist_headi_dentry; /* list of dentries */unsigned longi_ino; /* inode number */atomic_ti_count; /* reference counter */unsigned inti_nlink; /* number of hard links */uid_ti_uid; /* user id of owner */gid_ti_gid; /* group id of owner */kdev_ti_rdev; /* real device node */u64i_version; /* versioning number */loff_ti_size; /* file size in bytes */seqcount_ti_size_seqcount; /* serializer for i_size */structtimespeci_atime; /* last access time */structtimespeci_mtime; /* last modify time */structtimespeci_ctime; /* last change time */unsigned inti_blkbits; /* block size in bits */blkcnt_ti_blocks; /* file size in blocks */unsigned shorti_bytes; /* bytes consumed */umode_ti_mode; /* access permissions */spinlock_ti_lock; /* spinlock */structrw_semaphorei_alloc_sem; /* nests inside of i_sem */structsemaphorei_sem; /* inode semaphore */structinode_operations*i_op; /* inode ops table */structfile_operations*i_fop; /* default inode ops */structsuper_block*i_sb; /* associated superblock */structfile_lock*i_flock; /* file lock list */structaddress_space*i_mapping; /* associated mapping */structaddress_spacei_data; /* mapping for device */structdquot*i_dquot[MAXQUOTAS]; /* disk quotas for inode */structlist_headi_devices; /* list of block devices */union {
structpipe_inode_info*i_pipe; /* pipe information */structblock_device*i_bdev; /* block device driver */structcdev*i_cdev; /* character device driver */
};// 给定的inode同一时刻只能表示三者之一,所有用unionunsigned longi_dnotify_mask; /* directory notify mask */structdnotify_struct*i_dnotify; /* dnotify */structlist_headinotify_watches; /* inotify watches */structmutexinotify_mutex; /* protects inotify_watches */unsigned longi_state; /* state flags */unsigned longdirtied_when; /* first dirtying time */unsigned inti_flags; /* filesystem flags */atomic_ti_writecount; /* count of writers */void*i_security; /* security module */void*i_private; /* fs private pointer */
};
inode 代表文件系统上的每个文件,但 inode 对象仅在访问文件时在内存中构造.
Inode操作集
inode函数操作集描述了 VFS 可以在 inode 上调用的文件系统实现的功能。
structinode_operations {
int (*create) (structinode*,structdentry*,int, structnameidata*);
structdentry* (*lookup) (structinode*,structdentry*, structnameidata*);
int (*link) (structdentry*,structinode*,structdentry*);
int (*unlink) (structinode*,structdentry*);
int (*symlink) (structinode*,structdentry*,constchar*);
int (*mkdir) (structinode*,structdentry*,int);
int (*rmdir) (structinode*,structdentry*);
int (*mknod) (structinode*,structdentry*,int,dev_t);
int (*rename) (structinode*, structdentry*,
structinode*, structdentry*);
int (*readlink) (structdentry*, char__user*,int);
void* (*follow_link) (structdentry*, structnameidata*);
void (*put_link) (structdentry*, structnameidata*, void*);
void (*truncate) (structinode*);
int (*permission) (structinode*, int);
int (*setattr) (structdentry*, structiattr*);
int (*getattr) (structvfsmount*mnt, structdentry*, structkstat*);
int (*setxattr) (structdentry*, constchar*,constvoid*,size_t,int);
ssize_t (*getxattr) (structdentry*, constchar*, void*, size_t);
ssize_t (*listxattr) (structdentry*, char*, size_t);
int (*removexattr) (structdentry*, constchar*);
void (*truncate_range)(structinode*, loff_t, loff_t);
long (*fallocate)(structinode*inode, intmode, loff_toffset,
loff_tlen);
int (*fiemap)(structinode*, structfiemap_extent_info*, u64start,
u64len);
};
// TODO
### Dentry Object
VFS经常需要执行特定于目录的操作,例如路径名查找,为了实现路径查找,VFS实现了directoryentry(dentry)。dentry是路径中的特定组件。解析路径是非常耗时的操作,`dentry`对象使得这个过程容易很多。
> /bin/vi/ /, bin, andviarealldentryobjects.Thefirsttwoaredirectoriesandthelastisaregularfile. Dentriesmightalsoincludemountpoints. Inthepath /mnt/cdrom/foo, thecomponents /, mnt, cdrom, andfooarealldentryobjects.
dentry对象由`structdentry`表示
```Cstructdentry {
atomic_td_count; /* usage count */unsigned intd_flags; /* dentry flags */spinlock_td_lock; /* per-dentry lock */intd_mounted; /* is this a mount point? */structinode*d_inode; /* associated inode */structhlist_noded_hash; /* list of hash table entries */structdentry*d_parent; /* dentry object of parent */structqstrd_name; /* dentry name */structlist_headd_lru; /* unused list */union {
structlist_headd_child; /* list of dentries within */structrcu_headd_rcu; /* RCU locking */
} d_u;
structlist_headd_subdirs; /* subdirectories */structlist_headd_alias; /* list of alias inodes */unsigned longd_time; /* revalidate time */structdentry_operations*d_op; /* dentry operations table */structsuper_block*d_sb; /* superblock of file */void*d_fsdata; /* filesystem-specific data */unsigned chard_iname[DNAME_INLINE_LEN_MIN]; /* short name */
};
与前两个对象不同,dentry 对象不对应任何类型的磁盘数据结构。
Dentry State
一个有效的dentry object可以是三种状态之一:used, unused, or negative.
A used dentry corresponds to a valid inode (d_inode points to an associated inode) and indicates that there are one or more users of the object (d_count is positive).A used dentry is in use by the VFS and points to valid data and, thus, cannot be discarded.
An unused dentry corresponds to a valid inode (d_inode points to an inode), but the VFS is not currently using the dentry object (d_count is zero). Because the dentry object still points to a valid object, the dentry is kept around—cached—in case it is needed again.
A negative dentry is not associated with a valid inode (d_inode is NULL) because either the inode was deleted or the path name was never correct to begin with
As an example, assume that you are editing a source file in your home directory, /home/dracula/src/the_sun_sucks.c. Each time this file is accessed (for example, when you first open it, later save it, compile it, and so on), the VFS must follow each directory entry to resolve the full path: /, home, dracula, src, and finally the_sun_sucks.c.To avoid this time-consuming operation each time this path name is accessed, the VFS can first try to look up the path name in the dentry cache. If the lookup succeeds, the required final dentry object is obtained without serious effort. Conversely, if the dentry is not in the dentry cache, the VFS must manually resolve the path by walking the filesystem for each component of the path.After this task is completed, the kernel adds the dentry objects to the dcache to speed up any future lookups.
The object points back to the dentry (which in turn points back to the inode) that actually represents the open file.The inode and dentry objects, of course, are unique.
#include<linux/fs.h>structfile {
union {
structlist_headfu_list; /* list of file objects */structrcu_headfu_rcuhead; /* RCU list after freeing */
} f_u;
structpathf_path; /* contains the dentry */structfile_operations*f_op; /* file operations table */spinlock_tf_lock; /* per-file struct lock */atomic_tf_count; /* file object’s usage count */unsigned intf_flags; /* flags specified on open */mode_tf_mode; /* file access mode */loff_tf_pos; /* file offset (file pointer) */structfown_structf_owner; /* owner data for signals */conststructcred*f_cred; /* file credentials */structfile_ra_statef_ra; /* read-ahead state */u64f_version; /* version number */void*f_security; /* security module */void*private_data; /* tty driver hook */structlist_headf_ep_links; /* list of epoll links */spinlock_tf_ep_lock; /* epoll lock */structaddress_space*f_mapping; /* page cache mapping */unsigned longf_mnt_write_state; /* debugging state */
};
与dentry对象类似,file对象在物理磁盘上没有真实对应数据。
The file object does point to its associated dentry object via the f_dentry pointer.The dentry in turn points to the associated inode, which reflects whether the file itself is dirty.
File操作集
file对象的操作集由file_operations表示
#include<linux/fs.h>structfile_operations {
structmodule*owner;
loff_t (*llseek) (structfile*, loff_t, int);
ssize_t (*read) (structfile*, char__user*, size_t, loff_t*);
ssize_t (*write) (structfile*, constchar__user*, size_t, loff_t*);
ssize_t (*aio_read) (structkiocb*, conststructiovec*,
unsigned long, loff_t);
ssize_t (*aio_write) (structkiocb*, conststructiovec*,
unsigned long, loff_t);
int (*readdir) (structfile*, void*, filldir_t);
unsigned int (*poll) (structfile*, structpoll_table_struct*);
int (*ioctl) (structinode*, structfile*, unsigned int,
unsigned long);
long (*unlocked_ioctl) (structfile*, unsigned int, unsigned long);
long (*compat_ioctl) (structfile*, unsigned int, unsigned long);
int (*mmap) (structfile*, structvm_area_struct*);
int (*open) (structinode*, structfile*);
int (*flush) (structfile*, fl_owner_tid);
int (*release) (structinode*, structfile*);
int (*fsync) (structfile*, structdentry*, intdatasync);
int (*aio_fsync) (structkiocb*, intdatasync);
int (*fasync) (int, structfile*, int);
int (*lock) (structfile*, int, structfile_lock*);
ssize_t (*sendpage) (structfile*, structpage*,
int, size_t, loff_t*, int);
unsigned long (*get_unmapped_area) (structfile*,
unsigned long,
unsigned long,
unsigned long,
unsigned long);
int (*check_flags) (int);
int (*flock) (structfile*, int, structfile_lock*);
ssize_t (*splice_write) (structpipe_inode_info*,
structfile*,
loff_t*,
size_t,
unsigned int);
ssize_t (*splice_read) (structfile*,
loff_t*,
structpipe_inode_info*,
size_t,
unsigned int);
int (*setlease) (structfile*, long, structfile_lock**);
};
#include<linux/fs.h>structfile_system_type {
constchar*name; /* filesystem’s name */intfs_flags; /* filesystem type flags *//* the following is used to read the superblock off the disk */structsuper_block*(*get_sb) (structfile_system_type*, int,
char*, void*);
/* the following is used to terminate access to the superblock */void (*kill_sb) (structsuper_block*);
structmodule*owner; /* module owning the filesystem */structfile_system_type*next; /* next file_system_type in list */structlist_headfs_supers; /* list of superblock objects *//* the remaining fields are used for runtime lock validation */structlock_class_keys_lock_key;
structlock_class_keys_umount_key;
structlock_class_keyi_lock_key;
structlock_class_keyi_mutex_key;
structlock_class_keyi_mutex_dir_key;
structlock_class_keyi_alloc_sem_key;
};
#include<linux/mount.h>structvfsmount {
structlist_headmnt_hash; /* hash table list */structvfsmount*mnt_parent; /* parent filesystem */structdentry*mnt_mountpoint; /* dentry of this mount point */structdentry*mnt_root; /* dentry of root of this fs */structsuper_block*mnt_sb; /* superblock of this filesystem */structlist_headmnt_mounts; /* list of children */structlist_headmnt_child; /* list of children */intmnt_flags; /* mount flags */char*mnt_devname; /* device file name */structlist_headmnt_list; /* list of descriptors */structlist_headmnt_expire; /* entry in expiry list */structlist_headmnt_share; /* entry in shared mounts list */structlist_headmnt_slave_list; /* list of slave mounts */structlist_headmnt_slave; /* entry in slave list */structvfsmount*mnt_master; /* slave’s master */structmnt_namespace*mnt_namespace; /* associated namespace */intmnt_id; /* mount identifier */intmnt_group_id; /* peer group identifier */atomic_tmnt_count; /* usage count */intmnt_expiry_mark; /* is marked for expiration */intmnt_pinned; /* pinned count */intmnt_ghosts; /* ghosts count */atomic_t__mnt_writers; /* writers count */
};
structfiles_struct {
atomic_tcount; /* usage count */structfdtable*fdt; /* pointer to other fd table */structfdtablefdtab; /* base fd table */spinlock_tfile_lock; /* per-file lock */intnext_fd; /* cache of next available fd */structembedded_fd_setclose_on_exec_init; /* list of close-on-exec fds */structembedded_fd_setopen_fds_init/* list of open fds */structfile*fd_array[NR_OPEN_DEFAULT]; /* base files array */
};
数组 fd_array 指向打开的文件对象列表。
Because NR_OPEN_DEFAULT
is equal to BITS_PER_LONG, which is 64 on a 64-bit architecture; this includes room for 64 file objects. If a process opens more than 64 file objects, the kernel allocates a new array and points the fdt pointer at it.
另外一个进程相关的是fs_struct结构,它包含了进程文件系统相关信息,结构定义在
<linux/fs_struct.h>structfs_struct {
intusers; /* user count */rwlock_tlock; /* per-structure lock */intumask; /* umask */intin_exec; /* currently executing a file */structpathroot; /* root directory */structpathpwd; /* current working directory */
};
structmnt_namespace {
atomic_tcount; /* usage count */structvfsmount*root; /* root directory */structlist_headlist; /* list of mount points */wait_queue_head_tpoll; /* polling waitqueue */intevent; /* event count */
};
list成员指定构成命名空间的已挂载文件系统的双向链表
For most processes, the process descriptor points to unique files_struct and fs_struct structures. For processes created with the clone flag CLONE_FILES or CLONE_FS, however, these structures are shared.3 Consequently, multiple process descriptors might point to the same files_struct or fs_struct structure.The count member of each structure provides a reference count to prevent destruction while a process is still using the structure
The namespace structure works the other way around. By default, all processes share the same namespace. (That is, they all see the same filesystem hierarchy from the same mount table.) Only when the CLONE_NEWNS flag is specified during clone() is the process given a unique copy of the namespace structure. Because most processes do not provide this flag, all the processes inherit their parents’ namespaces. Consequently, on many systems there is only one namespace, although the functionality is but a single CLONE_NEWNS flag away
The text was updated successfully, but these errors were encountered:
虚拟文件系统(有时称为虚拟文件交换机或更常见的简称为 VFS)是内核的子系统,它实现了提供给用户空间程序的文件和文件系统相关接口
通用文件系统接口
VFS 是一种粘合剂,它使 open()、read() 和 write() 等系统调用能够在不考虑文件系统或底层物理介质的情况下工作。VFS 和块 I/O 层一起提供了抽象、接口和粘合,允许用户空间程序发出通用系统调用以通过任何文件系统上的统一命名策略访问文件,文件系统本身存在于任何存储介质上。
文件系统抽象层
这种适用于任何类型文件系统的通用接口之所以可行,只是因为内核围绕其低级文件系统接口实现了一个抽象层。抽象层通过定义所有文件系统支持的基本概念接口和数据结构来工作。实际上,除了文件系统本身之外,内核中没有任何东西需要了解文件系统的底层细节。比如
流程如下图
Unix文件系统
历史上,Unix提供了4个基本文件系统抽象:
files, directory entries, inodes, and mount points
。在 Unix 中,文件系统挂载在称为命名空间的全局层次结构中的特定挂载点。 这使所有挂载的文件系统都显示为单个树中的条目。文件是有序字节串。文件是有目录管理,目录类似于文件夹,通常包含相关文件,目录可以嵌套以形成路径路径的每个组成部分称为一个
directory entry
VFS对象和他们的数据结构
VFS是面向对象的。VFS有四种主要对象:
每个对象里面包含
operations
对象,这些对象描述了内核操作对于对象的方法:内核中每个已经注册的文件系统由
file_system_type
表示,挂载点由vfsmount
表示,这个结构包含一些挂载点信息,比如挂载位置和标志。每个进程里面由fs_struct
描述对应的文件系统和file
描述对应的文件。Superblock Object
超级块对象由每个文件系统实现,用于存储描述特定文件系统的信息,存储在磁盘上特定的扇区。
对于sysfs这里文件系统,superblock存放在内存中。superblock对象由
super_block
表示创建、管理和销毁supberbloc对象的代码在
fs/super.c
中。超级块对象是通过 alloc_super() 函数创建和初始化的。挂载后,文件系统调用此函数,从磁盘中读取其超级块,并填充其超级块对象
Superblock操作集
Superblock对象里面最重要的就是
s_op
了,指向一个superblock 操作表struct super_operations
,定义如下这个结构中的每一项都是一个指针,指向一个对超级块对象进行操作的函数。下面看一些常用函数
所有这些函数都由 VFS 在进程上下文中调用。如果需要,除了dirty_inode() 之外的所有函数都可以阻塞。上面一些函数可以位NULL,如果关联的指针为 NULL,则 VFS 要么调用通用函数,要么什么都不做,具体取决于操作。
Inode Object
inode 对象表示内核操作文件或目录所需的所有信息。
内核里面inode由
struct inode
表示inode 代表文件系统上的每个文件,但 inode 对象仅在访问文件时在内存中构造.
Inode操作集
inode函数操作集描述了 VFS 可以在 inode 上调用的文件系统实现的功能。
与前两个对象不同,dentry 对象不对应任何类型的磁盘数据结构。
Dentry State
一个有效的dentry object可以是三种状态之一:used, unused, or negative.
在内存紧缺的情况下,
unused
和negative
的内存都可以被回收Dentry Cache
挨个解析路径名称里面的dentry object是非常耗时的,因此内核将dentry objects缓存起来,称为
dcache
。dentry cache包含3个部分:
hash table由
dentry_hashtable
表示,hash值由d_hash()
决定,通过d_lookup()
查找hash表,如果在dcache中找到就返回,否则返回NULL。举例
dcache 还为 inode 缓存提供前端,即
icache
。只要 dentry 被缓存,相应的 inode 也会被缓存。Dentry操作集
Dentry操作集由
dentry_operations
表示。TODO:函数介绍
File对象
我们要查看的最后一个主要 VFS 对象是文件对象,文件对象用来表示进程打开的文件。对象由
open()
创建,由close()
销毁。因为同一个文件可能被多个进程打开,因此可能有多个file对象对应这个文件。可以通过dentry和inode找到file对应的真实文件。内核中file由
struct file
表示与dentry对象类似,file对象在物理磁盘上没有真实对应数据。
File操作集
file对象的操作集由
file_operations
表示//TODO 函数介绍
与文件系统相关的数据结构
除了基本的 VFS 对象外,内核还使用其他标准数据结构来管理与文件系统相关的数据。第一个对象用于描述文件系统的特定变体,例如 ext3、ext4 或 UDF。第二个数据结构描述文件系统的挂载实例。内核中由
file_system_type
表示文件系统。get_sb()
函数从磁盘读取超级块并在加载文件系统时填充超级块对象。无论当前系统中这个文件有多少挂载/未挂载的文件系统实例,这个文件系统都只有一个file_system_type
。挂载点由vfsmount_structure
表示维护所有挂载点列表的复杂部分是文件系统和所有其他挂载点之间的关系,vfsmount 中的各种链表会跟踪这些信息。vfsmount 结构还存储在 mnt_flags 字段中挂载时指定的标志(如果有)。定义在
<linux/mount.h>
和进程有关的数据结构
系统上的每个进程都有自己的打开文件列表、根文件系统、当前工作目录、挂载点等。三个数据结构将 VFS 层和系统上的进程联系在一起:
files_struct
、fs_struct
和namespace
。进程打开的文件和文件描述符都在
file_struct
中,由进程的file
域表示数组 fd_array 指向打开的文件对象列表。
另外一个进程相关的是
fs_struct
结构,它包含了进程文件系统相关信息,结构定义在root
根目录,pwd
当前目录。最后一个是
namespace
结构,定义在<linux/mnt_namespace.h>
,由进程的mnt_namespace
域指向。这使每个进程都能对系统上已挂载的文件系统有一个独特的视角。list成员指定构成命名空间的已挂载文件系统的双向链表
The text was updated successfully, but these errors were encountered: