linux的虚拟文件系统二（数据结构）

1.概念从本质上讲，文件系统是特殊的数据分层存储结构，它包含文件、目录和相关的控制信息。为了描述这个结构，linux引入了一些基本概念:文件一组在逻辑上具有完整意义的信息项的系列。在Linux中，除了普通文件，其他诸如目录、设备、套接字等也以文件被对待。总之，“一切皆文件”。目录目录好比一个文件夹，用来容纳相关文件。因为目录可以包含子目录，所以目录是可以层层嵌套，形成文件路径...

奔跑的小刺猬

4263人浏览 · 2018-10-01 23:56:29

奔跑的小刺猬 · 2018-10-01 23:56:29 发布

1.概念

从本质上讲，文件系统是特殊的数据分层存储结构，它包含文件、目录和相关的控制信息。为了描述这个结构，linux引入了一些基本概念:

文件一组在逻辑上具有完整意义的信息项的系列。在Linux中，除了普通文件，其他诸如目录、设备、套接字等也以文件被对待。总之，“一切皆文件”。

目录目录好比一个文件夹，用来容纳相关文件。因为目录可以包含子目录，所以目录是可以层层嵌套，形成文件路径。在Linux中，目录也是以一种特殊文件被对待的，所以用于文件的操作同样也可以用在目录上。

目录项 在一个文件路径中，路径中的每一部分都被称为目录项；如路径/home/source/helloworld.c中，目录 /， home， source和文件 helloworld.c都是一个目录项。

索引节点 用于存储文件的元数据的一个数据结构。文件的元数据，也就是文件的相关信息，和文件本身是两个不同的概念。它包含的是诸如文件的大小、拥有者、创建时间、磁盘位置等和文件相关的信息。

超级块 用于存储文件系统的控制信息的数据结构。描述文件系统的状态、文件系统类型、大小、区块数、索引节点数等，存放于磁盘的特定扇区中。

关于文件系统的三个易混淆的概念：

创建以某种方式格式化磁盘的过程就是在其之上建立一个文件系统的过程。创建文现系统时，会在磁盘的特定位置写入关于该文件系统的控制信息（通常我们说的格式化为某个文件系统格式）。

注册向内核报到，声明自己能被内核支持。一般在编译内核的时侯注册；也可以加载模块的方式手动注册。注册过程实际上是将表示各实际文件系统的数据结构struct file_system_type 实例化。

安装也就是我们熟悉的mount操作，将文件系统加入到Linux的根文件系统的目录树结构上；这样文件系统才能被访问（通常我们称为挂载）。

假设一块磁盘被分为好几个分区，每个分区都是不同的文件系统。

磁盘与文件系统

图片示例_磁盘与文件系统.jpg

2. VFS数据结构

VFS依靠四个主要的数据结构和一些辅助的数据结构来描述其结构信息，这些数据结构表现得就像是对象；每个主要对象中都包含由操作函数表构成的操作对象，这些操作对象描述了内核针对这几个主要的对象可以进行的操作。

2.1 自举块

为磁盘分区的第一个块，记录文件系统分区的一些信息，引导加载当前分区的程序和数据被保存在这个块中。也被称为引导块或MBR（主引导记录）

2.2 超级块（superblock）

存储一个已安装的文件系统的控制信息，代表一个已安装的文件系统；每次一个实际的文件系统被安装时，内核会从磁盘的特定位置读取一些控制信息来填充内存中的超级块对象。一个安装实例和一个超级块对象一一对应。超级块通过其结构中的一个域s_type记录它所属的文件系统类型。

它记录的信息主要有：block与inode的总量、使用量、剩余量，文件系统的挂载时间，最近一次写入数据的时间等。可以说，没有超级块，就没有这个文件系统了。inode是用来记录文件属性的，比如说：文件的权限、所有者与组、文件的大小、修改时间等。一个文件占用一个inode，系统读取文件的时候，需要先找到inode，并分析inode所记录的权限与用户是否符合，若符合才能够开始实际读取block的内容。其处于文件系统开始位置的1k处，所占大小为1k。为了系统的健壮性，最初每个块组都有超级块和组描述符表(以下将用GDT)的一个拷贝，但是当文件系统很大时，这样浪费了很多块(尤其是GDT占用的块多)，后来采用了一种稀疏的方式来存储这些拷贝，只有块组号是3, 5 ,7的幂的块组(譬如说1,3,5,7,9,25,49…)才备份这个拷贝。通常情况下，只有主拷贝(第0块块组)的超级块信息被文件系统使用，其它拷贝只有在主拷贝被破坏的情况下才使用。

下面列出超级快的代码清单：


struct super_block {
	struct list_head	s_list;		/* Keep this first 指向超级块链表的指针 */
	dev_t			s_dev;		/* search index; _not_ kdev_t */
	unsigned char		s_blocksize_bits;
	unsigned long		s_blocksize;
	loff_t			s_maxbytes;	/* Max file size */
	struct file_system_type	*s_type;                /* 文件系统类型 */
	const struct super_operations	*s_op;            /* 超级块方法 */
	const struct dquot_operations	*dq_op;
	const struct quotactl_ops	*s_qcop;
	const struct export_operations *s_export_op;
	unsigned long		s_flags;
	unsigned long		s_iflags;	/* internal SB_I_* flags */
	unsigned long		s_magic;
	struct dentry		*s_root;
	struct rw_semaphore	s_umount;
	int			s_count;
	atomic_t		s_active;
#ifdef CONFIG_SECURITY
	void                    *s_security;
#endif
	const struct xattr_handler **s_xattr;

	struct list_head	s_inodes;	/* all inodes */
	struct hlist_bl_head	s_anon;		/* anonymous dentries for (nfs) exporting */
	struct list_head	s_mounts;	/* list of mounts; _not_ for fs use */
	struct block_device	*s_bdev;
	struct backing_dev_info *s_bdi;
	struct mtd_info		*s_mtd;
	struct hlist_node	s_instances;
	struct quota_info	s_dquot;	/* Diskquota specific options */

	struct sb_writers	s_writers;

	char s_id[32];				/* Informational name */
	u8 s_uuid[16];				/* UUID */

	void 			*s_fs_info;	/* Filesystem private info */
	unsigned int		s_max_links;
	fmode_t			s_mode;

	/* Granularity of c/m/atime in ns.
	   Cannot be worse than a second */
	u32		   s_time_gran;

	/*
	 * The next field is for VFS *only*. No filesystems have any business
	 * even looking at it. You had been warned.
	 */
	struct mutex s_vfs_rename_mutex;	/* Kludge */

	/*
	 * Filesystem subtype.  If non-empty the filesystem type field
	 * in /proc/mounts will be "type.subtype"
	 */
	char *s_subtype;

	/*
	 * Saved mount options for lazy filesystems using
	 * generic_show_options()
	 */
	char __rcu *s_options;
	const struct dentry_operations *s_d_op; /* default d_op for dentries */

	/*
	 * Saved pool identifier for cleancache (-1 means none)
	 */
	int cleancache_poolid;

	struct shrinker s_shrink;	/* per-sb shrinker handle */

	/* Number of inodes with nlink == 0 but still referenced */
	atomic_long_t s_remove_count;

	/* Being remounted read-only */
	int s_readonly_remount;

	/* AIO completions deferred from interrupt context */
	struct workqueue_struct *s_dio_done_wq;

	/*
	 * Keep the lru lists last in the structure so they always sit on their
	 * own individual cachelines.
	 */
	struct list_lru		s_dentry_lru ____cacheline_aligned_in_smp;
	struct list_lru		s_inode_lru ____cacheline_aligned_in_smp;
	struct rcu_head		rcu;

	/*
	 * Indicates how deep in a filesystem stack this SB is
	 */
	int s_stack_depth;
};

这个数据结构十分庞大，毕竟是聚集了一个文件系统的重要信息，我们关注一些比较重要的信息就行了。

struct list_head	s_list;		/* Keep this first */

s_list 这是第一个成员，是一个双向循环链表，把所有的super_block连接起来，一个super_block代表一个在linux上的文件系统，这个list上边的就是所有的在linux上记录的文件系统。注释表明，它必须位于超级快结构体的首位置。

dev_t			s_dev;		/* search index; _not_ kdev_t */

s_dev：包含该具体文件系统的块设备标识符。例如，对于 /dev/hda1，其设备标识符为 0x301

unsigned char		s_blocksize_bits;
unsigned long		s_blocksize;
loff_t			s_maxbytes;	/* Max file size */

s_blocksize：文件系统中数据块大小，以字节单位

s_blocksize_bits：上面的size大小占用位数，例如512字节就是9 bits

s_maxbytes：允许的最大的文件大小(字节数)

struct file_system_type    *s_type;

struct file_system_type *s_type：文件系统类型(也就是当前这个文件系统属于哪个类型？ext2还是fat32)要区分“文件系统”和“文件系统类型”不一样！一个文件系统类型下可以包括很多文件系统即很多的super_block。

const struct super_operations	*s_op;
const struct dquot_operations	*dq_op;

struct super_operations *s_op：指向某个特定的具体文件系统的用于超级块操作的函数集合。

struct dquot_operations *dq_op：指向某个特定的具体文件系统用于限额操作的函数集合。

const struct quotactl_ops	*s_qcop;

struct quotactl_ops *s_qcop：用于配置磁盘限额的的方法，处理来自用户空间的请求。

const struct export_operations *s_export_op;

struct export_operations *s_export_op：导出方法

unsigned long		s_flags;
unsigned long		s_iflags;	/* internal SB_I_* flags */
unsigned long		s_magic;
struct dentry		*s_root;
struct rw_semaphore	s_umount;
int			s_count;
atomic_t		s_active;

s_flags：安装标识

s_magic：区别于其他文件系统的标识

s_root：指向该具体文件系统安装目录的目录项

s_umount：对超级块读写时进行同步

s_count：对超级块的使用计数

s_active：引用计数

const struct xattr_handler **s_xattr;

s_xarttr：指向结构指针，该结构包含一些用于处理扩展属性的指针

struct list_head	s_inodes;	/* all inodes */

s_inodes: 把所有索引对象链接在一起，存放的是头结点

struct hlist_bl_head	s_anon;		/* anonymous dentries for (nfs) exporting */
struct block_device	*s_bdev;
struct backing_dev_info *s_bdi;
struct mtd_info		*s_mtd;
struct hlist_node	s_instances;
struct quota_info	s_dquot;	/* Diskquota specific options */
struct sb_writers	s_writers;

s_anon: 匿名目录项

s_bdev: 指向文件系统被安装的块设备

s_bdi：块设备信息

s_instances:同一文件系统，通过这个链接起来

s_dquot：磁盘限额相关

其它的都是有注释的。

超级块方法


struct super_operations {
        //该函数在给定的超级块下创建并初始化一个新的索引节点对象
   	struct inode *(*alloc_inode)(struct super_block *sb);
        //释放指定的索引结点 。
	void (*destroy_inode)(struct inode *);
        //VFS在索引节点被修改时会调用此函数。
   	void (*dirty_inode) (struct inode *, int flags);
        // 将指定的inode写回磁盘。
	int (*write_inode) (struct inode *, struct writeback_control *wbc);
        //删除索引节点。
	int (*drop_inode) (struct inode *);
        
	void (*evict_inode) (struct inode *);
        //用来释放超级块
	void (*put_super) (struct super_block *);
        //使文件系统的数据元素与磁盘上的文件系统同步，wait参数指定操作是否同步。
	int (*sync_fs)(struct super_block *sb, int wait);
	int (*freeze_fs) (struct super_block *);
	int (*unfreeze_fs) (struct super_block *);
        //获取文件系统状态。把文件系统相关的统计信息放在statfs中
	int (*statfs) (struct dentry *, struct kstatfs *);
	int (*remount_fs) (struct super_block *, int *, char *);
	void (*umount_begin) (struct super_block *);

	int (*show_options)(struct seq_file *, struct dentry *);
	int (*show_devname)(struct seq_file *, struct dentry *);
	int (*show_path)(struct seq_file *, struct dentry *);
	int (*show_stats)(struct seq_file *, struct dentry *);
#ifdef CONFIG_QUOTA
	ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
	ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
#endif
	int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
	long (*nr_cached_objects)(struct super_block *, int);
	long (*free_cached_objects)(struct super_block *, long, int);
};

2.3 索引节点

索引节点对象存储了文件的相关信息，代表了存储设备上的一个实际的物理文件。当一个文件首次被访问时，内核会在内存中组装相应的索引节点对象，以便向内核提供对一个文件进行操作时所必需的全部信息；保存的其实是实际的数据的一些信息，这些信息称为“元数据”(也就是对文件属性的描述)。例如：文件大小，设备标识符，用户标识符，用户组标识符，文件模式，扩展属性，文件读取或修改的时间戳，链接数量，指向存储该内容的磁盘区块的指针，文件分类等等。这些信息一部分存储在磁盘特定位置，另外一部分是在加载时动态填充的。

( 注意数据分成：元数据+数据本身 )

同时注意：inode有两种，一种是VFS的inode，一种是具体文件系统的inode。前者在内存中，后者在磁盘中。所以每次其实是将磁盘中的inode调进填充内存中的inode，这样才是算使用了磁盘文件inode。

注意inode怎样生成的：每个inode节点的大小，一般是128字节或256字节。inode节点的总数，在格式化时就给定(现代OS可以动态变化)，一般每2KB就设置一个inode。一般文件系统中很少有文件小于2KB的，所以预定按照2KB分，一般inode是用不完的。所以inode在文件系统安装的时候会有一个默认数量，后期会根据实际的需要发生变化。

注意inode号：inode号是唯一的，表示不同的文件。其实在Linux内部的时候，访问文件都是通过inode号来进行的，所谓文件名仅仅是给用户容易使用的。当我们打开一个文件的时候，首先，系统找到这个文件名对应的inode号；然后，通过inode号，得到inode信息，最后，由inode找到文件数据所在的block，现在可以处理文件数据了。

inode和文件的关系：当创建一个文件的时候，就给文件分配了一个inode。一个inode只对应一个实际文件，一个文件也会只有一个inode。inodes最大数量就是文件的最大数量。


/*
 * Keep mostly read-only and often accessed (especially for
 * the RCU path lookup and 'stat' data) fields at the beginning
 * of the 'struct inode'
 */
struct inode {
	umode_t			i_mode;              /* 访问权限控制  */
	unsigned short		i_opflags;
	kuid_t			i_uid;               /* 使用者的id */
	kgid_t			i_gid;               /* 使用组id  */
	unsigned int		i_flags;         /* 文件系统标志 */

#ifdef CONFIG_FS_POSIX_ACL
	struct posix_acl	*i_acl;
	struct posix_acl	*i_default_acl;
#endif

	const struct inode_operations	*i_op;        /*索引节点操作表*/
	struct super_block	*i_sb;                    /* 相关的超级块  */
	struct address_space	*i_mapping;           /* 相关的地址映射 */

#ifdef CONFIG_SECURITY
	void			*i_security;
#endif

	/* Stat data, not accessed from path walking */
	unsigned long		i_ino;                    /* 索引节点号 */
	/*
	 * Filesystems may only read i_nlink directly.  They shall use the
	 * following functions for modification:
	 *
	 *    (set|clear|inc|drop)_nlink
	 *    inode_(inc|dec)_link_count
	 */
	union {
		const unsigned int i_nlink;
		unsigned int __i_nlink;            /* 硬连接数 */
	};
	dev_t			i_rdev;                /* 实际设备标识符号 */
	loff_t			i_size;
	struct timespec		i_atime;            /* 最后访问时间 */
	struct timespec		i_mtime;            /* 最后修改时间 */
	struct timespec		i_ctime;            /* 最后改变时间  */
	spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
	unsigned short          i_bytes;         /* 使用的字节数 */
	unsigned int		i_blkbits;
	blkcnt_t		i_blocks;                /* 文件的块数 */

#ifdef __NEED_I_SIZE_ORDERED
	seqcount_t		i_size_seqcount;
#endif

	/* Misc */
	unsigned long		i_state;
	struct mutex		i_mutex;

	unsigned long		dirtied_when;	/* jiffies of first dirtying 首次修改时间 */

	struct hlist_node	i_hash;      /*  散列表 */
	struct list_head	i_wb_list;	/* backing dev IO list */
	struct list_head	i_lru;		/* inode LRU list 未使用的inode */
	struct list_head	i_sb_list;  /* 链接一个文件系统中所有inode的链表 */
	union {
		struct hlist_head	i_dentry;  /* 目录项链表  */
		struct rcu_head		i_rcu;
	};
	u64			i_version;
	atomic_t		i_count;           /* 引用计数 */
	atomic_t		i_dio_count;
	atomic_t		i_writecount;      /* 写者计数 */
#ifdef CONFIG_IMA
	atomic_t		i_readcount; /* struct files open RO */
#endif
	const struct file_operations	*i_fop;	/* former ->i_op->default_file_ops 文件操作 */
	struct file_lock	*i_flock;            /* 文件锁链表 */
	struct address_space	i_data;            /* 表示被inode读写的页面 */
#ifdef CONFIG_QUOTA
	struct dquot		*i_dquot[MAXQUOTAS];    /* 节点的磁盘限额 */
#endif
	struct list_head	i_devices;            /* 设备链表(共用同一个驱动程序的设备形成的链表。) */
	union {
		struct pipe_inode_info	*i_pipe;     /* 管道信息 */
		struct block_device	*i_bdev;         /* 块设备驱动节点 */
		struct cdev		*i_cdev;             /* 字符设备驱动节点 */
	};

	__u32			i_generation;            /* 索引节点版本号 */

#ifdef CONFIG_FSNOTIFY
	__u32			i_fsnotify_mask; /* all events this inode cares about */
	struct hlist_head	i_fsnotify_marks;
#endif

	void			*i_private; /* fs or device private pointer用户私有数据 */
};

注意管理inode的四个链表：

static struct hlist_head *inode_hashtable __read_mostly;

i_hash ：为了提高查找inode的效率，每一个inode都会有一个hash值。该字段指向hash值相同的inode所形成的双链表该字段包含prev和next两个指针，分别指向上述链表的前一个元素和后一个元素；

i_list ：所有索引结点形成的双联表，（从图上可以看出，索引节点对象是靠它来链接的）
i_dentry ：所有引用该inode的目录项将形成一个双联表，该字段即为这个双联表的头结点

1.在同一个文件系统中，每个索引节点号都是唯一的，内核可以根据索引节点号的散列值来查找其inode结构。

2.inode中有两个设备号i_dev和i_rdev。

a.特设文件外，每个节点都存储在某个设备上，这就是i_dev。

b. 如果索引节点所代表的并不是常规文件，而是某个设备，则需要另一个设备号，这就是i_rdev。

3.对i_state的说明：

每个VFS索引节点都会复制磁盘索引节点包含的一些数据，比如文件占有的磁盘数。如果i_state 的值等于I_DIR,该索引节点就是“脏“的。也就是说，对应的磁盘索引节点必须被更新。

4.三个重要的双向链表：

a.未用索引节点链表，正在使用索引节点链表和脏索引节点链表。每个索引节点对象总是出现在上面三种的一个。

b.这3个链表都是通过索引节点的i_list 域链接在一起的。

c.属于“正在使用“或“脏“链表的索引节点对象也同时存放在一个散列表中。

<散列表加快了对索引节点对象的搜索>.

5.一个索引节点代表文件系统中的一个文件，它也可以是设备或管道这样的特殊文件。所以在索引节点结构体中有一些和特殊文件相关的项。
6.有时候某些文件系统并不能完整地包含索引节点结构体要求的所有信息。那么此时刚怎么办呢？

此时，可以给它赋一些其它的值。

例如：一个文件系统可能并不记录文件的访问时间，这时就可以在i_atime中存储0。

节点方法

struct inode_operations {
	struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
	void * (*follow_link) (struct dentry *, struct nameidata *);
	int (*permission) (struct inode *, int);
	struct posix_acl * (*get_acl)(struct inode *, int);

	int (*readlink) (struct dentry *, char __user *,int);
	void (*put_link) (struct dentry *, struct nameidata *, void *);

	int (*create) (struct inode *,struct dentry *, umode_t, bool);
	int (*link) (struct dentry *,struct inode *,struct dentry *);
	int (*unlink) (struct inode *,struct dentry *);
	int (*symlink) (struct inode *,struct dentry *,const char *);
	int (*mkdir) (struct inode *,struct dentry *,umode_t);
	int (*rmdir) (struct inode *,struct dentry *);
	int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t);
	int (*rename) (struct inode *, struct dentry *,
			struct inode *, struct dentry *);
	int (*rename2) (struct inode *, struct dentry *,
			struct inode *, struct dentry *, unsigned int);
	int (*setattr) (struct dentry *, struct iattr *);
	int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
	int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
	ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
	ssize_t (*listxattr) (struct dentry *, char *, size_t);
	int (*removexattr) (struct dentry *, const char *);
	int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start,
		      u64 len);
	int (*update_time)(struct inode *, struct timespec *, int);
	int (*atomic_open)(struct inode *, struct dentry *,
			   struct file *, unsigned open_flag,
			   umode_t create_mode, int *opened);
	int (*tmpfile) (struct inode *, struct dentry *, umode_t);
	int (*set_acl)(struct inode *, struct posix_acl *, int);
} ____cacheline_aligned;

对其中一些重要的结果进行分析：

create() ：如果该inode描述一个目录文件，那么当在该目录下创建或打开一个文件时，内核必须为这个文件创建一个inode。VFS通过调用该inode的i_op->create()函数来完成上述新inode的创建。该函数的第一个参数为该目录的 inode，第二个参数为要打开新文件的dentry，第三个参数是对该文件的访问权限。如果该inode描述的是一个普通文件，那么该inode永远都不会调用这个create函数；
lookup() ：查找指定文件的dentry；
link() ：用于在指定目录下创建一个硬链接。这个link函数最终会被系统调用link()调用。该函数的第一个参数是原始文件的dentry，第二个参数即为上述指定目录的inode，第三个参数是链接文件的dentry。
unlink ()：在某个目录下删除指定的硬链接。这个unlink函数最终会被系统调用unlink()调用。第一个参数即为上述硬链接所在目录的inode，第二个参数为要删除文件的dentry。
symlink ()：在某个目录下新建
mkdir()：在指定的目录下创建一个子目录，当前目录的inode会调用i_op->mkdir()。该函数会被系统调用mkdir()调用。第一个参数即为指定目录的inode，第二个参数为子目录的dentry，第三个参数为子目录权限；
rmdir ()：从inode所描述的目录中删除一个指定的子目录时，该函数会被系统调用rmdir()最终调用；
mknod() ：在指定的目录下创建一个特殊文件，比如管道、设备文件或套接字等。

2.4 目录项

所谓"文件", 就是按一定的形式存储在介质上的信息，所以一个文件其实包含了两方面的信息，一是存储的数据本身，二是有关该文件的组织和管理的信息。在内存中, 每个文件都有一个dentry(目录项)和inode(索引节点)结构，dentry记录着文件名，上级目录等信息，正是它形成了我们所看到的树状结构；而有关该文件的组织和管理的信息主要存放inode里面，它记录着文件在存储介质上的位置与分布。同时dentry->d_inode指向相应的inode结构。dentry与inode是多对一的关系，因为有可能一个文件有好几个文件名。所有的dentry用d_parent和d_child连接起来，就形成了我们熟悉的树状结构。注意不管是文件夹还是最终的文件，都是属于目录项，所有的目录项在一起构成一颗庞大的目录树。例如：open一个文件/home/xxx/yyy.txt，那么/、home、xxx、yyy.txt都是一个目录项，VFS在查找的时候，根据一层一层的目录项找到对应的每个目录项的inode，那么沿着目录项进行操作就可以找到最终的文件。

目录项的块，存储的是这个目录下的所有的文件的inode号和文件名等信息。其内部是树形结构，操作系统检索一个文件，都是从根目录开始，按层次解析路径中的所有目录，直到定位到文件。

注意：目录也是一种文件(所以也存在对应的inode)。打开目录，实际上就是打开目录文件。


struct dentry {
	/* RCU lookup touched fields */
	unsigned int d_flags;		/* protected by d_lock */
	seqcount_t d_seq;		/* per dentry seqlock */
	struct hlist_bl_node d_hash;	/* lookup hash list */
	struct dentry *d_parent;	/* parent directory 父目录 */
	struct qstr d_name;
	struct inode *d_inode;		/* Where the name belongs to - NULL is
					 * negative 与该目录项关联的inode */
	unsigned char d_iname[DNAME_INLINE_LEN];	/* small names 短文件名  */

	/* Ref lookup also touches following */
	struct lockref d_lockref;	/* per-dentry lock and refcount */
	const struct dentry_operations *d_op;    /* 目录项操作 */
	struct super_block *d_sb;	/* The root of the dentry tree 这个目录项所属的文件系统的超级块(目录项树的根) */
	unsigned long d_time;		/* used by d_revalidate 重新生效时间 */
	void *d_fsdata;			/* fs-specific data  具体文件系统的数据 */

	struct list_head d_lru;		/* LRU list 未使用目录以LRU 算法链接的链表 */
	struct list_head d_child;	/* child of parent list 目录项通过这个加入到父目录的d_subdirs中 */
	struct list_head d_subdirs;	/* our children 本目录的所有孩子目录链表头 */
	/*
	 * d_alias and d_rcu can share memory
	 */
	union {
		struct hlist_node d_alias;	/* inode alias list 索引节点别名链表 */
	 	struct rcu_head d_rcu;
	} d_u;
};

一个有效的dentry结构必定有一个inode结构，这是因为一个目录项要么代表着一个文件，要么代表着一个目录，而目录实际上也是文件。所以，只要dentry结构是有效的，则其指针d_inode必定指向一个inode结构。但是inode却可以对应多个。

注意：整个结构其实就是一棵树，如果看过我的设备模型kobject就能知道，目录其实就是文件（kobject、inode）再加上一层封装，这里所谓的封装主要就是增加两个指针，一个是指向父目录，一个是指向该目录所包含的所有文件（普通文件和目录）的链表头。这样才能有我们的目录操作（比如回到上次目录，只需要一个指针步骤【..】，而进入子目录需要链表索引需要多个步骤）

dentry相关的操作（inode里面已经包含了mkdir，rmdir，mknod之类的了）

struct dentry_operations {
        /* 该函数判断目录对象是否有效。VFS准备从dcache中使用一个目录项时，会调用该函数. */
	int (*d_revalidate)(struct dentry *, unsigned int);       
	int (*d_weak_revalidate)(struct dentry *, unsigned int);
        /* 该目录生成散列值，当目录项要加入到散列表时，VFS要调用此函数。 */
	int (*d_hash)(const struct dentry *, struct qstr *);    
        /* 该函数来比较name1和name2这两个文件名。使用该函数要加dcache_lock锁。 */
	int (*d_compare)(const struct dentry *, const struct dentry *,
			unsigned int, const char *, const struct qstr *);
        /* 当d_count=0时，VFS调用次函数。使用该函数要叫 dcache_lock锁。 */
	int (*d_delete)(const struct dentry *);
        /* 当该目录对象将要被释放时，VFS调用该函数。 */
	void (*d_release)(struct dentry *);
	void (*d_prune)(struct dentry *);
        /* 当一个目录项丢失了其索引节点时，VFS就掉用该函数。 */
	void (*d_iput)(struct dentry *, struct inode *);
	char *(*d_dname)(struct dentry *, char *, int);
	struct vfsmount *(*d_automount)(struct path *);
	int (*d_manage)(struct dentry *, bool);
} ____cacheline_aligned;

https://blog.csdn.net/denzilxu/article/details/9188003

2.5 文件对象

注意文件对象描述的是进程已经打开的文件。因为一个文件可以被多个进程打开，所以一个文件可以存在多个文件对象。但是由于文件是唯一的，那么inode就是唯一的，目录项也是定的！

进程其实是通过文件描述符来操作文件的，注意每个文件都有一个32位的数字来表示下一个读写的字节位置，这个数字叫做文件位置。一般情况下打开文件后，打开位置都是从0开始，除非一些特殊情况。Linux用file结构体来保存打开的文件的位置，所以file称为打开的文件描述。这个需要好好理解一下！file结构形成一个双链表，称为系统打开文件表。

struct file {
	union {
		struct llist_node	fu_llist;    /* 每个文件系统中被打开的文件都会形成一个双链表 */
		struct rcu_head 	fu_rcuhead;
	} f_u;
	struct path		f_path;                
#define f_dentry	f_path.dentry
	struct inode		*f_inode;	/* cached value */
	const struct file_operations	*f_op;    /* 指向文件操作表的指针 */
ENODATA
	/*
	 * Protects f_ep_links, f_flags.
	 * Must not be taken from IRQ context.
	 */
	spinlock_t		f_lock;
	atomic_long_t		f_count;                /* 文件对象的使用计数 */
	unsigned int 		f_flags;                /* 打开文件时所指定的标志 */
	fmode_t			f_mode;                     /* 文件的访问模式(权限等) */
	struct mutex		f_pos_lock;
	loff_t			f_pos;                      /* 文件当前的位移量 */
	struct fown_struct	f_owner;
	const struct cred	*f_cred;
	struct file_ra_state	f_ra;               /* /预读状态 */

	u64			f_version;                      /* 版本号 */
#ifdef CONFIG_SECURITY    
	void			*f_security;                /* 安全模块 */
#endif
	/* needed for tty driver, and maybe others */
	void			*private_data;              /* /tty设备hook */

#ifdef CONFIG_EPOLL
	/* Used by fs/eventpoll.c to link all the hooks to this file */
	struct list_head	f_ep_links;
	struct list_head	f_tfile_llink;
#endif /* #ifdef CONFIG_EPOLL */
	struct address_space	*f_mapping;        /* 页缓存映射 */
} __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */

1、文件对象实际上没有对应的磁盘数据，所以在结构体中没有代表其对象是否为脏，是否需要写回磁盘的标志。文件对象通过f_path.dentry指针指向相关的目录项对象。目录项会指向相关的索引节点，索引节点会记录文件是否是脏的。
2、 fu_list：每个文件系统中以被打开的文件都会形成一个双联表，这个双联表的头结点存放在超级块的s_files字段中。该字段的prev和next指针分别指向在链表中与当前文件结构体相邻的前后两个元素.

file结构中主要保存了文件位置，此外还把指向该文件索引节点的指针也放在其中。

有人就问了，问什么不直接把文件位置存放在索引节点中呢？因为：Linux中的文件是能够共享的，假如把文件位置存放在索引节点中，当有两个或更多个进程同时打开一个文件时，它们将去访问同一个索引节点，那么一个进程的lseek操作将影响到另一个进程的读操作，这显然是致命的错误。

文件方法（操作）

struct file_operations {
	struct module *owner;
	loff_t (*llseek) (struct file *, loff_t, int);
	ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
	ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
	ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
	ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t);
	ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
	ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);
	int (*iterate) (struct file *, struct dir_context *);
	unsigned int (*poll) (struct file *, struct poll_table_struct *);
	long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
	long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
	int (*mmap) (struct file *, struct vm_area_struct *);
	int (*open) (struct inode *, struct file *);
	int (*flush) (struct file *, fl_owner_t id);
	int (*release) (struct inode *, struct file *);
	int (*fsync) (struct file *, loff_t, loff_t, int datasync);
	int (*aio_fsync) (struct kiocb *, int datasync);
	int (*fasync) (int, struct file *, int);
	int (*lock) (struct file *, int, struct file_lock *);
	ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int);
	unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long);
	int (*check_flags)(int);
	int (*flock) (struct file *, int, struct file_lock *);
	ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, size_t, unsigned int);
	ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, size_t, unsigned int);
	int (*setlease)(struct file *, long, struct file_lock **);
	long (*fallocate)(struct file *file, int mode, loff_t offset,
			  loff_t len);
	int (*show_fdinfo)(struct seq_file *m, struct file *f);
};

上面这个对我们驱动开发人员应该是最熟悉的，也是必须掌握的了。

owner：用于指定拥有这个文件操作结构体的模块，通常取THIS_MODULE；
llseek：用于设置文件的偏移量。第一个参数指明要操作的文件，第二个参数为偏移量，第三个参数为开始偏移的位置（可取SEEK_SET,SEEK_CUR和SEEK_END之一）。
read：从文件中读数据。第一个参数为源文件，第二个参数为目的字符串，第三个参数指明欲读数据的总字节数，第四个参数指明从源文件的某个偏移量处开始读数据。由系统调用read()调用；
write：往文件里写数据。第一个参数为目的文件，第二个参数源字符串，第三个参数指明欲写数据的总字节数，第四个参数指明从目的文件的某个偏移量出开始写数据。由系统调用write()调用；
mmap：将指定文件映射到指定的地址空间上。由系统调用mmap()调用；
open：打开指定文件，并且将这个文件和指定的索引结点关联起来。由系统调用open()调用；
release：释放以打开的文件，当打开文件的引用计数（f_count）为0时，该函数被调用；
fsync()：文件在缓冲的数据写回磁盘；

2.6 文件系统相关

根据文件系统所在的物理介质和数据在物理介质上的组织方式来区分不同的文件系统类型的。 file_system_type结构用于描述具体的文件系统的类型信息。被Linux支持的文件系统，都有且仅有一个file_system_type结构而不管它有零个或多个实例被安装到系统中。

而与此对应的是每当一个文件系统被实际安装，就有一个vfsmount结构体被创建，这个结构体对应一个安装点。

struct file_system_type {
	const char *name;                        /*文件系统的名字*/
	int fs_flags;                            /*文件系统类型标志*/
#define FS_REQUIRES_DEV		1 
#define FS_BINARY_MOUNTDATA	2
#define FS_HAS_SUBTYPE		4
#define FS_USERNS_MOUNT		8	/* Can be mounted by userns root */
#define FS_USERNS_DEV_MOUNT	16 /* A userns mount does not imply MNT_NODEV */
#define FS_USERNS_VISIBLE	32	/* FS must already be visible */
#define FS_NOEXEC		64	/* Ignore executables on this fs */
#define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move() during rename() internally. */
	struct dentry *(*mount) (struct file_system_type *, int,
		       const char *, void *);
	void (*kill_sb) (struct super_block *);        /* 终止访问超级块*/    
	struct module *owner;                          /* 文件系统模块*/
	struct file_system_type * next;                /*链表中的下一个文件系统类型*/
	struct hlist_head fs_supers;                   /*具有同一种文件系统类型的超级块对象链表*/

	struct lock_class_key s_lock_key;
	struct lock_class_key s_umount_key;
	struct lock_class_key s_vfs_rename_key;
	struct lock_class_key s_writers_key[SB_FREEZE_LEVELS];

	struct lock_class_key i_lock_key;
	struct lock_class_key i_mutex_key;
	struct lock_class_key i_mutex_dir_key;
};

struct vfsmount {
	struct dentry *mnt_root;	/* root of the mounted tree 该文件系统的根目录项对象  */
	struct super_block *mnt_sb;	/* pointer to superblock 该文件系统的超级块 */
	int mnt_flags;                  /*安装标志*/
};

struct mount {
	struct hlist_node mnt_hash;                        /*散列表*/
	struct mount *mnt_parent;                        /*父文件系统挂接*/
	struct dentry *mnt_mountpoint;                    /*安装点的目录项对象*/
	struct vfsmount mnt;
	struct rcu_head mnt_rcu;
#ifdef CONFIG_SMP
	struct mnt_pcp __percpu *mnt_pcp;
#else
	int mnt_count;                    /*使用计数*/
	int mnt_writers;
#endif
	struct list_head mnt_mounts;	/* list of children, anchored here */
	struct list_head mnt_child;	/* and going through their mnt_child */
	struct list_head mnt_instance;	/* mount instance on sb->s_mounts */
	const char *mnt_devname;	/* Name of device e.g. /dev/dsk/hda1 */
	struct list_head mnt_list;        /*描述符链表*/
	struct list_head mnt_expire;	/* link in fs-specific expiry list */
	struct list_head mnt_share;	/* circular list of shared mounts */
	struct list_head mnt_slave_list;/* list of slave mounts */
	struct list_head mnt_slave;	/* slave list entry */
	struct mount *mnt_master;	/* slave is on master->mnt_slave_list */
	struct mnt_namespace *mnt_ns;	/* containing namespace */
	struct mountpoint *mnt_mp;	/* where is it mounted */
#ifdef CONFIG_FSNOTIFY
	struct hlist_head mnt_fsnotify_marks;
	__u32 mnt_fsnotify_mask;
#endif
	int mnt_id;			/* mount identifier */
	int mnt_group_id;		/* peer group identifier */
	int mnt_expiry_mark;		/* true if marked for expiry */
	int mnt_pinned;
	struct path mnt_ex_mountpoint;        /* 外部的挂载点 */
};

2.7 和进程相关

struct task_struct {
    ......
/* CPU-specific state of this task */
	struct thread_struct thread;        /* 进程相关 */
/* filesystem information */
	struct fs_struct *fs;               /* 建立进程与文件系统的关系 */
/* open file information */
	struct files_struct *files;         /* 打开的文件集 */
    ......
}

上面这个进程结构是一个超级大的结构体，里面包含了进程的方方面面，我们这里只分析和文件相关的两个。

//建立进程与文件系统的关系
struct fs_struct {
	int users;                 
	spinlock_t lock;            /*保护该结构体的锁*/
	seqcount_t seq;
	int umask;                /*默认的文件访问权限*/
	int in_exec;
	struct path root, pwd;     /*根目录的目录项对象, 当前工作目录的目录项对象*/
};

struct path {
	struct vfsmount *mnt;
	struct dentry *dentry;
};

/* 进程打开的文件集 */
struct files_struct {
  /*
   * read mostly part
   */
	atomic_t count;                           /* 结构的使用计数,表明当前被多少进程打开 */
	struct fdtable __rcu *fdt;
	struct fdtable fdtab;                     /* 默认使用这个,标记下面数组的,如果打开的文件超过NR_OPEN_DEFAULT,就需要动态申请空间了,申请的由上面这个标记 */
  /*
   * written part on a separate cache line in SMP
   */
	spinlock_t file_lock ____cacheline_aligned_in_smp;
	int next_fd;                              /* 下一个文件描述符,方便申请文件描述符 */
	unsigned long close_on_exec_init[1];      /* exec()关闭的文件描述符 */
	unsigned long open_fds_init[1];           /* 文件描述符的初始集合 */
	struct file __rcu * fd_array[NR_OPEN_DEFAULT];        /* 默认的文件对象数组 */
};


struct fdtable {
	unsigned int max_fds;        /* 当前fd_array里,最大可以打开的文件数量 */
	struct file __rcu **fd;      /* current fd array 默认是files_struct 里面的fd_array,如果超出,就需要动态申请,这个就会指向动态申请的 */
	unsigned long *close_on_exec;
	unsigned long *open_fds;     /* 存放进程已经打开的文件描述符 */
	struct rcu_head rcu;         /* 动态申请的和之前的通过链表链接 */
};

fd默认情况下，指向fd_arry数组。因为NR_OPEN_DEFAULT等于32，所以该数组可以容纳32个文件对象。如果一个进程所打开的文件对象超过32个。内核将分配一个新数组，并且将fd指向它。

fd_arry中存放当前进程打开的所有文件file的地址。默认0、1、2是标准输入，标准输出、标准错误。

2.8 和路径查找相关

struct nameidata {
	struct path	path;            /*目录项对象的路径*/
	struct qstr	last;            /*路径中的最后一个component*/
	struct path	root;            /*该文件系统的根路径*/
	struct inode	*inode; /* path.dentry.d_inode */
	unsigned int	flags;       /*查找标识*/
	unsigned	seq, m_seq;
	int		last_type;           /*路径中的最后一个component的类型*/
	unsigned	depth;           /*当前symbolic link的嵌套深度，不能大于6*/
	char *saved_names[MAX_NESTED_LINKS + 1];    /*和嵌套symbolic link 相关的pathname*/
};

一般用户程序打开文件，都是通过路径pathname，打开文件。所以找到需要都需要通过pathname，找到在文件系统中的inode，之后再通过inode找相对应的file_operation

2.9 对象间的联系

如上的数据结构并不是孤立存在的。正是通过它们的有机联系，VFS才能正常工作。如下的几张图是对它们之间的联系的描述。

如下图所示，被Linux支持的文件系统，都有且仅有一个file_system_type结构而不管它有零个或多个实例被安装到系统中。每安装一个文件系统，就对应有一个超级块和安装点。超级块通过它的一个域s_type指向其对应的具体的文件系统类型。具体的文件系统通过file_system_type中的一个域fs_supers链接具有同一种文件类型的超级块。同一种文件系统类型的超级块通过域s_instances链接。

超级块、安装点和具体的文件系统的关系

图片示例_超级块、安装点和具体的文件系统的关系

进程与超级块、文件、索引结点、目录项的关系

图片示例_进程与超级块、文件、索引结点、目录项的关系

一般在open函数的时候，进程会通过pathname（包括path和name，即entry），找到inode，进而找到它里面的file_operation方法，把这个方法填充到file_struct中的fd_array数组未使用的最小对应项中，返回该项下标，即我们应用程序所谓的文件描述符。

之后的read，write等都是通过文件描述符，直接找到file_struct中的对应数组项，直接操纵对应的驱动函数。

file，dentry，inode，super_block以及超级块的位置约定都属于VFS层；inode中i_fop和file中的f_op一样的；虽然每个文件都有目录项和索引节点在磁盘上，但是只有在需要时才会在内存中为之建立起相应的dentry和inode数据结构；特殊文件在内存中也有inode数据结构和dentry数据结构，但是不一定存储介质上有索引节点和目录项；特殊文件一般与外部设备无关，所涉及的信息通常是内存和CPU本身；如/dev/null就是一个特殊文件，凡是写入的数据都会被丢弃；inode是各种文件抽象的一个共同点；

3. 举例说明

文件与IO: 每个进程在PCB（Process Control Block）中都保存着一份文件描述符表，文件描述符就是这个表的索引，每个表项都有一个指向已打开文件的指针，现在我们明确一下：已打开的文件在内核中用file结构体表示，文件描述符表中的指针指向file结构体。

在file结构体中维护File Status Flag（file结构体的成员f_flags）和当前读写位置（file结构体的成员f_pos）。在上图中，进程1和进程2都打开同一文件，但是对应不同的file结构体，因此可以有不同的File Status Flag和读写位置。file结构体中比较重要的成员还有f_count，表示引用计数（Reference Count），后面我们会讲到，dup、fork等系统调用会导致多个文件描述符指向同一个file结构体，例如有fd1和fd2都引用同一个file结构体，那么它的引用计数就是2，当close(fd1)时并不会释放file结构体，而只是把引用计数减到1，如果再close(fd2)，引用计数就会减到0同时释放file结构体，这才真的关闭了文件。

每个file结构体都指向一个file_operations结构体，这个结构体的成员都是函数指针，指向实现各种文件操作的内核函数。比如在用户程序中read一个文件描述符，read通过系统调用进入内核，然后找到这个文件描述符所指向的file结构体，找到file结构体所指向的file_operations结构体，调用它的read成员所指向的内核函数以完成用户请求。在用户程序中调用lseek、read、write、ioctl、open等函数，最终都由内核调用file_operations的各成员所指向的内核函数完成用户请求。file_operations结构体中的release成员用于完成用户程序的close请求，之所以叫release而不叫close是因为它不一定真的关闭文件，而是减少引用计数，只有引用计数减到0才关闭文件。对于同一个文件系统上打开的常规文件来说，read、write等文件操作的步骤和方法应该是一样的，调用的函数应该是相同的，所以图中的三个打开文件的file结构体指向同一个file_operations结构体。如果打开一个字符设备文件，那么它的read、write操作肯定和常规文件不一样，不是读写磁盘的数据块而是读写硬件设备，所以file结构体应该指向不同的file_operations结构体，其中的各种文件操作函数由该设备的驱动程序实现。

每个file结构体都有一个指向dentry结构体的指针，“dentry”是directory entry（目录项）的缩写。我们传给open、stat等函数的参数的是一个路径，例如/home/akaedu/a，需要根据路径找到文件的inode。为了减少读盘次数，内核缓存了目录的树状结构，称为dentry cache，其中每个节点是一个dentry结构体，只要沿着路径各部分的dentry搜索即可，从根目录/找到home目录，然后找到akaedu目录，然后找到文件a。dentry cache只保存最近访问过的目录项，如果要找的目录项在cache中没有，就要从磁盘读到内存中。

每个dentry结构体都有一个指针指向inode结构体。inode结构体保存着从磁盘inode读上来的信息。在上图的例子中，有两个dentry，分别表示/home/akaedu/a和/home/akaedu/b，它们都指向同一个inode，说明这两个文件互为硬链接。inode结构体中保存着从磁盘分区的inode读上来信息，例如所有者、文件大小、文件类型和权限位等。每个inode结构体都有一个指向inode_operations结构体的指针，后者也是一组函数指针指向一些完成文件目录操作的内核函数。和file_operations不同，inode_operations所指向的不是针对某一个文件进行操作的函数，而是影响文件和目录布局的函数，例如添加删除文件和目录、跟踪符号链接等等，属于同一文件系统的各inode结构体可以指向同一个inode_operations结构体。

inode结构体有一个指向super_block结构体的指针。super_block结构体保存着从磁盘分区的超级块读上来的信息，例如文件系统类型、块大小等。super_block结构体的s_root成员是一个指向dentry的指针，表示这个文件系统的根目录被mount到哪里，在上图的例子中这个分区被mount到/home目录下。

file、dentry、inode、super_block这几个结构体组成了VFS的核心概念。对于ext2文件系统来说，在磁盘存储布局上也有inode和超级块的概念，所以很容易和VFS中的概念建立对应关系。而另外一些文件系统格式来自非UNIX系统（例如Windows的FAT32、NTFS），可能没有inode或超级块这样的概念，但为了能mount到Linux系统，也只好在驱动程序中硬凑一下，在Linux下看FAT32和NTFS分区会发现权限位是错的，所有文件都是rwxrwxrwx，因为它们本来就没有inode和权限位的概念，这是硬凑出来的。

参考博文:

https://blog.csdn.net/shanshanpt/article/details/38943731

https://www.linuxidc.com/Linux/2011-02/32127.htm

https://www.ibm.com/developerworks/cn/linux/l-cn-vfs/#icomments

http://www.sohu.com/a/160074374_777180

Linux

更多推荐