Linux Kernel | Device Mapper 模块（2）数据结构

书接上回，本篇来分析Device Mapper的各个数据结构以及它们之间的关系。首先给出一图：这个图便较为详尽地描述了dm各个数据结构间的联系，我们可以发现与dm相关的数据结构有dm_table，dm_target，dm_dev。再介绍它们之前，我们再回看一下gendisk结构：///common/include/linux/genhd.h:121struct gendisk {/* ma...

赵同学的代码时间

333人浏览 · 2022-12-04 18:57:00

赵同学的代码时间 · 2022-12-04 18:57:00 发布

书接上回，本篇来分析Device Mapper的各个数据结构以及它们之间的关系。首先给出一图：

这个图便较为详尽地描述了dm各个数据结构间的联系，我们可以发现与dm相关的数据结构有dm_table，dm_target，dm_dev。再介绍它们之前，我们再回看一下gendisk结构：

///common/include/linux/genhd.h:121
struct gendisk {
  /* major, first_minor and minors are input parameters only,
   * don't use directly.  Use disk_devt() and disk_max_parts().
   */
  int major;      /* major number of driver */
  int first_minor;
  int minors;                     /* maximum number of minors, =1 for
                                         * disks that can't be partitioned. */


  char disk_name[DISK_NAME_LEN];  /* name of major driver */


  unsigned short events;    /* supported events */
  unsigned short event_flags;  /* flags related to event processing */


  struct xarray part_tbl;
  struct block_device *part0;


  const struct block_device_operations *fops;
  struct request_queue *queue;
  void *private_data;


  int flags;
  unsigned long state;
#define GD_NEED_PART_SCAN    0
#define GD_READ_ONLY      1
#define GD_DEAD        2


  struct mutex open_mutex;  /* open/close mutex */
  unsigned open_partitions;  /* number of open partitions */


  struct backing_dev_info  *bdi;
  struct kobject *slave_dir;
#ifdef CONFIG_BLOCK_HOLDER_DEPRECATED
  struct list_head slave_bdevs;
#endif
  struct timer_rand_state *random;
  atomic_t sync_io;    /* RAID */
  struct disk_events *ev;
#ifdef  CONFIG_BLK_DEV_INTEGRITY
  struct kobject integrity_kobj;
#endif  /* CONFIG_BLK_DEV_INTEGRITY */
#if IS_ENABLED(CONFIG_CDROM)
  struct cdrom_device_info *cdi;
#endif
  int node_id;
  struct badblocks *bb;
  struct lockdep_map lockdep_map;
  u64 diskseq;
};

在新生成的mapped_device和所绑定的通用磁盘gendisk之间由private_data域和request_queue建立链接。

谈论dm设备，最关键的莫过于映射表了，映射表dm_table结构如下：

//common/drivers/md/dm-core.h:168
struct dm_table {
  struct mapped_device *md; // 指向被映射设备设备描述符
  enum dm_queue_mode type;  // Type of table, mapped_device's mempool and request_queue 


  /* btree table 为方便查找映射目标采用了B树结构 */
  unsigned int depth; // b树深度
  unsigned int counts[DM_TABLE_MAX_DEPTH]; /* b树各层节点数 */
  sector_t *index[DM_TABLE_MAX_DEPTH]; // b树各层索引值


  unsigned int num_targets; // 映射目标的数目
  unsigned int num_allocated; // 偏移数组和目标数组
  sector_t *highs; // 偏移数组指针
  struct dm_target *targets; // 目标数组指针


  struct target_type *immutable_target_type;


  bool integrity_supported:1;
  bool singleton:1;
  unsigned integrity_added:1;


  /*
   * Indicates the rw permissions for the new logical
   * device.  This should be a combination of FMODE_READ
   * and FMODE_WRITE.
   */
  fmode_t mode;


  /* a list of devices used by this table */
  struct list_head devices; // 双向链表头


  /* events get handed up using this callback */
  void (*event_fn)(void *); // 映射表的事件回调函数
  void *event_context; // 映射表的事件回调函数参数


  struct dm_md_mempools *mempools; // 分配特定结构的存储池，被映射设备使用


#ifdef CONFIG_BLK_INLINE_ENCRYPTION
  struct blk_keyslot_manager *ksm;
#endif
};

通过dm_table，定义了mapped_device↔dm_target之间的联系。偏移数组类型为sector_t，和dm_target有同样多的num_targets项，偏移数组用来指出每一个映射规则的扇区位置：

举个例子，对一下两条映射规则：

0  2056320  striped 2 32 /dev/hda 0 /dev/hdb
2056320 2875602 linear /dev/hdb 1028169

它们的偏移数组也有两项，分别为2056319和（2056320+2875602-1），表明该规则截止扇区位置，可以方便地为映射设备的指定扇区找到对应的映射目标（纯升序，平均O(n/2)查找效率）。

偏移数组和目标数组的空间一同分配，目标数组紧接着在偏移数组之后。highs和targets分别指出它们的起始位置，num_targets指出使用长度，num_allocated指出分配空间的总长度，最末尾追加一个空项用于捕获越界的情况。

从扇区找查映射目标是必不可少的步骤，基于性能考虑，采用了btree查找算法。在映射表构造完成，所有目标都被添加到映射表之后，初始化b树，之后使用b树作为查找树查找。

一个映射表包含一条或多条映射规则，每条映射规则都用映射目标dm_target来表示，dm_target结构如下：

//common/include/linux/device-mapper.h:294
struct dm_target {
  struct dm_table *table; // 映射表
  struct target_type *type; // 映射目标类型，之后讨论，应该是linear\striped\mirror\snapshot\error等


  /* target limits */
  sector_t begin;
  sector_t len;


  /* If non-zero, maximum size of I/O submitted to a target. */
  uint32_t max_io_len; // 最大io长度


  /*
   * A number of zero-length barrier bios that will be submitted
   * to the target for the purpose of flushing cache.
   *
   * The bio number can be accessed with dm_bio_get_target_bio_nr.
   * It is a responsibility of the target driver to remap these bios
   * to the real underlying devices.
   */
  unsigned num_flush_bios;


  /*
   * The number of discard bios that will be submitted to the target.
   * The bio number can be accessed with dm_bio_get_target_bio_nr.
   */
  unsigned num_discard_bios;


  /*
   * The number of secure erase bios that will be submitted to the target.
   * The bio number can be accessed with dm_bio_get_target_bio_nr.
   */
  unsigned num_secure_erase_bios;


  /*
   * The number of WRITE SAME bios that will be submitted to the target.
   * The bio number can be accessed with dm_bio_get_target_bio_nr.
   */
  unsigned num_write_same_bios;


  /*
   * The number of WRITE ZEROES bios that will be submitted to the target.
   * The bio number can be accessed with dm_bio_get_target_bio_nr.
   */
  unsigned num_write_zeroes_bios;


  /*
   * The minimum number of extra bytes allocated in each io for the
   * target to use.
   */
  unsigned per_io_data_size;


  /* target specific data */
  void *private;


  /* Used to provide an error string from the ctr */
  char *error;


  /*
   * Set if this target needs to receive flushes regardless of
   * whether or not its underlying devices have support.
   */
  bool flush_supported:1;


  /*
   * Set if this target needs to receive discards regardless of
   * whether or not its underlying devices have support.
   */
  bool discards_supported:1;


  /*
   * Set if we need to limit the number of in-flight bios when swapping.
   */
  bool limit_swap_bios:1;


  /*
   * Set if this target implements a a zoned device and needs emulation of
   * zone append operations using regular writes.
   */
  bool emulate_zone_append:1;
};

在type域中指出了映射目标类型，里面包含了这个类型映射目标的回调函数。private域为映射目标私有配置结构的指针，根据映射目标类型的不同指向不同的结构。

target_type:

映射目标类型，每个目标类型以名字为标识，对应一个模块，分别实现了构造、解构、映射、endio、挂起、恢复和状态等函数，以实现映射目标的语义。

//common/include/linux/device-mapper.h:179
struct target_type {
  uint64_t features; // 这个域实际未用到
  const char *name; // 目标类型的名字
  struct module *module; // 指向实现了这个目标类型的模块的指针
  unsigned version[3]; // 版本号
  dm_ctr_fn ctr; // 构造回调函数
  dm_dtr_fn dtr; // 析构回调函数
  dm_map_fn map; // 映射回调函数
  dm_clone_and_map_request_fn clone_and_map_rq;
  dm_release_clone_request_fn release_clone_rq;
  dm_endio_fn end_io; // 完成回调函数
  dm_request_endio_fn rq_end_io; // 完成回调函数，适用于基于块设备驱动层的映射设备
  dm_presuspend_fn presuspend; //挂起前被调用
  dm_presuspend_undo_fn presuspend_undo;
  dm_postsuspend_fn postsuspend; // 挂起后被调用
  dm_preresume_fn preresume; //恢复前被调用
  dm_resume_fn resume; // 恢复时被调用
  dm_status_fn status; //状态报告回调函数
  dm_message_fn message; //用于向该映射目标传递消息的回调函数
  dm_prepare_ioctl_fn prepare_ioctl; //ioctl前
  dm_report_zones_fn report_zones; //一个我们之后可能要重点关注的函数
  dm_busy_fn busy; // 判断是否可以向映射目标派发io，适用于基于块设备驱动层的映射设备
  dm_iterate_devices_fn iterate_devices; //用于遍历所有底层设备的回调函数
  dm_io_hints_fn io_hints; //用于计算映射目标的队列限制的回调函数
  dm_dax_direct_access_fn direct_access;  dm_dax_copy_iter_fn dax_copy_from_iter;
  dm_dax_copy_iter_fn dax_copy_to_iter;
  dm_dax_zero_page_range_fn dax_zero_page_range;


  /* For internal device-mapper use. */
  struct list_head list; // 双向链表头
};

Linux内核将所有已注册的目标类型组成一个链表，全局变量_targets为链表表头，所有目标类型描述符使用list作为链接件链入。在映射目标类型注册时，除了给出目标类型的名字，还需要给出一些回调函数。

继续分析mapped_device映射设备结构：

//common/drivers/md/dm-core.h:30
/*
 * DM core internal structures used directly by dm.c, dm-rq.c and dm-table.c.
 * DM targets must _not_ deference a mapped_device or dm_table to directly
 * access their members!
 */


struct mapped_device {
  struct mutex suspend_lock;


  struct mutex table_devices_lock;
  struct list_head table_devices;


  /*
   * The current mapping (struct dm_table *).
   * Use dm_get_live_table{_fast} or take suspend_lock for
   * dereference.
   */
  void __rcu *map;


  unsigned long flags; // 标志


  /* Protect queue and type against concurrent access. */
  struct mutex type_lock;
  enum dm_queue_mode type;


  int numa_node_id;
  struct request_queue *queue; // 请求队列描述符指针


  atomic_t holders; // 持有者计数
  atomic_t open_count; // 以块设备文件形式访问的计数器


  struct dm_target *immutable_target;
  struct target_type *immutable_target_type;


  char name[16]; //major:minor表示的被映射设备名
  struct gendisk *disk; // 指向通用磁盘描述符的指针
  struct dax_device *dax_dev;


  unsigned long __percpu *pending_io; // 在我们被挂起时到达的io数目


  /*
   * A list of ios that arrived while we were suspended.
   */
  struct work_struct work;
  wait_queue_head_t wait;
  spinlock_t deferred_lock;
  struct bio_list deferred;


  void *interface_ptr; // 用户空间接口


  /*
   * Event handling.
   */
  wait_queue_head_t eventq;
  atomic_t event_nr;
  atomic_t uevent_seq;
  struct list_head uevent_list;
  spinlock_t uevent_lock; /* Protect access to uevent_list */


  /* the number of internal suspends */
  unsigned internal_suspend_count;


  /*
   * io objects are allocated from here.
   */
  struct bio_set io_bs;
  struct bio_set bs;


  /*
   * Processing queue (flush)
   */
  struct workqueue_struct *wq;


  /* forced geometry settings */
  struct hd_geometry geometry;


  /* kobject and completion */
  struct dm_kobject_holder kobj_holder;


  int swap_bios;
  struct semaphore swap_bios_semaphore;
  struct mutex swap_bios_lock;


  struct dm_stats stats;


  /* for blk-mq request-based DM support */
  struct blk_mq_tag_set *tag_set;
  bool init_tio_pdu:1;


  struct srcu_struct io_barrier;


#ifdef CONFIG_BLK_DEV_ZONED
  unsigned int nr_zones;
  unsigned int *zwp_offset;
#endif


#ifdef CONFIG_IMA
  struct dm_ima_measurements ima;
#endif
};

映射设备分为两种类型：基于通用块层请求（bio-based）和基于块设备驱动层请求（request-based）。基于通用快层请求的映射设备相对比较容易理解，它就是通常的“栈式”设备，前述条带、快照源和快照设备是这种。而基于块设备驱动层请求的设备需要特别解释，虽然也是映射设备但它必须是“低层”设备，不允许堆栈，其中一个例子是“多路径”设备。

映射设备作为一种虚拟磁盘类设备，mapped_device代表了该设备的专有数据，它用disk为指向反映其一般数据的通用磁盘描述符，用queue域指向请求队列描述符，通过块设备编号和块设备描述符关联，实际上将其保存在bdev域，所有提到的这些域在创建映射设备时被准备好。

映射设备类型在创建映射设备是根据映射表创建，基于通用块层的映射设备要比基于块设备驱动层请求的映射设备应用更广。

当然，作为初学的我们，看到这里绝对是云里雾里...

dm_dev: 低层设备结构

//common/include/linux/device-mapper.h：158
struct dm_dev {
  struct block_device *bdev; // 指向底层设备block_device描述符
  struct dax_device *dax_dev;
  fmode_t mode; // 文件访问模式
  char name[16]; // 设备名(major:minor)
};

映射表中包含了一个或若干个低层设备，它们通过链表链接起来，每一个低层设备都有这样的一个结构，在dm_table域中，devices域指向了这样一个设备链表。

在不同目标的私有数据结构中，必定有指针指向在映射规则中出现的低层设备对应的dm_dev描述符。

在将映射目标添加到映射表中时，会调用对应映射目标类型的构造函数，后者进而会调用dm_get_device函数在dm_table的devices链表中查找各个设备对应的dm_dev结构，如果没有找到，则需要构造。

对映射目标的指定有两种方式，major:minor方式或者dev_path方式，使用相应的函数构造结构，打开设备，获得块设备描述符，构造好dm_dev结构之后，需要将它插入到dm_table的devices链表中，同时将目标的private域指向它。

在每个映射目标的析构函数中，要调用dm_put_device释放掉dm_dev结构，当一个dm_dev结构的使用计数器（类似于引用计数）减到0时，就可以释放它的内存空间。

本文主要参考《存储技术原理分析》第七章，主要记录了dm中可能用到的数据结构，可以发现这些数据结构相当复杂，要搞懂可能还要花很大的功夫。