[转帖]Linux块层技术全面剖析-v0.1

linux,技术,全面,剖析,v0 · 浏览次数 : 0

小编点评

排版如下: ``` if (q)                 blk_queue_exit(q);         return ret; } ``` ``` else                     bio_io_error(bio);         } ``` ``` current->bio_list = NULL; /* deactivate */   out:         if (q)                 blk_queue_exit(q);         return ret; } ```

正文

 

 

 

 

 

 

 

 

 

 

 

 

Linux块层技术全面剖析-v0.1

perftrace@gmail.com

 

  1. 前言

网络上很多文章对块层的描述散乱在各个站点,而一些经典书籍由于更新不及时难免更不上最新的代码,例如关于块层的多队列。那么,是时候写一个关于linux块层的中文专题片章了,本文基于内核4.17.2。

因为文章中很多内容都可以单独领出来做个专题文章,所以信息量还是比较大的。如果大家发现阅读过程中无法顺序读完,那么可以选择性的看,或者阶段性的来看,例如看一段消化一段时间,毕竟文章也是不一下子写完的。最后文章给出的参考链接是非常棒的素材,如果英文可以建议大家去浏览一下不必细究,因为很多内容基本已经融合在文章中了。

  1. 总体逻辑

操作系统内核这个东西真是复杂,如果上来直接说代码,我想很可能会被大家直接嫌弃。所以,我们先列整体的框架,从逻辑入手,先抽象再具体。Let’s go。

块层是文件系统访问存储设备的接口,是连接文件系统和驱动的桥梁。

块层从代码上其实还可以分为两层,一个是bio层,另一个是request层。如下图所示。

 

    1. bio 层

文件系统中最后会调用函数generic_make_request,将请求构造成bio结构体,并将其传递到bio层。该函数本身也不返回,等io请求处理完毕后,会异步调用bio->bi_end_io中指定的函数。bio层其实就是这个generic_make_request函数,非常的薄。

bio层下面就是request层,可以看到有多队列和单队列之分。

    1. Request层

request 层有单队列和多队列之分,多队列也是内核进化出来的结果,将来很有可能只剩下多队列了。如果是单队列generic_make_request会调用blk_queue_io,如果是多队列则调用blk_mq_make_request.

      1. 单队列

单队列主要是考虑传统的机械硬件,因为机械臂同一时刻只能在一个地方,所以单队列足矣。换个角度说,单队列也正是被旋转磁盘需要的。为了充分利用机械磁盘的特性,单队列中需要有三个关键的任务:

  1. 收集多个连续操作到一组请求中,充分利用硬件。这个通过代码会对队列进行检查,看请求是否可以接收新的bio。如果可以接收,调度器就会同意合并;否则,后续考虑和其他请求合并。请求会变得很大且连续。
  2. 将请求按顺序排列,减少寻道时间, 这样不会延时重要的请求。但是我们无法知道每个请求的重要程度,以及寻道浪费的时间。这个需要通过队列调度器,例如deadline,cfq,noop等。
  3. 为了让请求到达底层驱动,当请求准备好时候需要踢到底层,而请求完成时候需要有机制通知。驱动在会通过blk_init_queue_node函数来注册一个request_fn()函数。在新请求出现在队列时候会被调用。驱动会调用blk_peek_request函数来收集请求并处理,当请求完成后驱动会继续去取请求而不是调用request_fn()函数。每个请求完成后会调用blk_finish_request()函数。

一些设备同时可以接收多个请求,在上一个请求完成前接受新的请求。可以将请求进行打标签,当请求完成的时候可以继续执行合适的请求。随着设备不断的进步可以内部执行更多调度工作后,对多队列的就越来越大了。

      1. 多队列

多队列的另一个动机是,系统中核数越来越多,最后都放入到一个队列中,直接导致性能瓶颈。

如果在每个NUMA节点或者cpu上分配队列,请求放到队列的传输压力会大大减少。但是如果硬件一次提交一个,那么多队列最后需要做合并操作。

       我们知道cfq调取在内部也会有多个队列,但是和cfq调度不同, cfq是将请求同优先级关联,而多队列是将队列绑定到硬件设备。

       多队列的request层有两个硬件相关队列:软件staging队列(也叫submission queues)和硬件dispatch队列

软件staging队列结构体是blk_mq_ctx,基于cpu硬件数量来分配,每个numa节点或者cpu分配一个,请求被添加到这些队列中。这些队列被独立的多队列调度器管理,例如常用的:bfq, kyber, mq-deadline。不同CPU下软件队列并不会去跨cpu聚合。

       硬件dispatch队列基于目标硬件来分配,可能只有一个也有可能有2048个。驱动向底层驱动负责。request 层给每个硬件队列分配一个blk_mq_hw_ctx结构体(硬件上下文),最后请求和硬件上下文一起被传递到底层驱动。这个队列需要负责控制提交给设备驱动的速度,防止过载。当前请求从软件队列所下发的硬件队列,最好是在同一个cpu上运行,提高cpu缓存本地化率。

       多队列还有一个不同于单队列的是,多队列的request结构体是与分配好的。每个request结构体都有一个数字标签,用于区分设备。

       多队列不是提供request_fn()函数,而是需要操作函数结构体blk_mq_ops,其中定义了函数,最重要的是queue_rq()函数。还有一些其他的,超时、polling、请求初始化等函数。如果调度认为请求已经准备好不需要继续放在队列上的时候,将会调用queue_rq()函数,将请求下放到request层之外,单队列中是由驱动来从队列中获取请求的。queue_rq函数会将请求放到内部FIFO队列,并直接处理。函数也可以通过返回BLK_STS_RESOURCE来拒接接收请求,这回导致请求继续处于staging 队列中。处理返回BLK_STS_RESOURCE和BLK_STS_OK之外,其他返回都表示错误。

        1. 多队列调度

       多队列驱动并不需要配置调度器,工作类似于单队列的noop调度器。当调用blk_mq_run_hw_queue()或blk_mq_delay_run_hw_queue()时候,会将请求传给驱动。多队列的调度函数定义在函数集elevator_mq_ops中,主要的有insert_requests()和dispatch_request()函数。Insert_requests会将请求插入到staging队列中,dispatch_request函数会选择一个请求传入到给定的硬件队列。当然,可以不提供insert_requests函数,内核也是很有弹性的,那么请求增加在最后了,如果没有提供dispatch_request,请求会从任何一个staging队列中取然后投入到硬件队列中,这个对性能有伤害(如果设备只有一个硬件队列,那就无所谓了)。

       上面我们提到多队列常用的有三个调度器,mq-deadline,bfq和kyber。

       mq-deadline中也有insert_request函数,它会忽略staging队列,直接将请求插入到两个全局基于时间排序的读写队列。而dispatch_request函数会基于时间、大小、饥饿程度来返回一个队列。注意这里的函数和elevator_mq_ops中是不一样的名字,少了一个s.

       bfq调度器是cfq的升级版,是Budget Fair Queueing的缩写。不过更像mq-deadline,不适用每个cpu的staging队列。如果有多队列则通过一个自旋锁来被所有cpu获取。

       Kyber I/O 调度器,会利用每个cpu或者每个node的staging 队列。该调度器没有提供insert_quest函数,使用默认方式。dispatch_request函数基于硬件上下文来维护内部队列。这个调度在17年初大家才开始讨论,所以很多细节将来可能固定下来,先一跃而过了。

        1. 多队列拓扑图

       最后我们来看下一个最清楚不够的图,如下:

从图中我们可以看到software staging queues是大于hardware dispatch queues的。

    不过有其实有三种情况:

  1. software staging queues大于hardware dispatch queues

2个或多个software staging queues分配到一个硬件队列,分发请求时候回从所有相关软件队列中获取请求。

  1. software staging queues小于hardware dispatch queues

这个场景下,软件队列顺序映射到硬件队列。

  1. software staging queues等于hardware dispatch queues

这个场景就是1:1映射的。

 

      1. 多队列何时取代单队列

这个可能还需要些时间,因为任何新事物都是不完美。另外在红帽在内部存储测试时候发现mq-deadlien的性能问题,还有一些公司也在测试时候发现性能倒退。不过,这个只是时间问题,并不会很久远。

       倒霉的是,很多书籍中描述的都是单队列的情况,当然也包括书中描述的调度器。幸好,我们这个专题书籍涉及了,欢迎分享给自己的小伙伴。

    1. bio based driver

早些时候,为了绕过系统单队列的瓶颈,需要将绕过request 层,那么驱动叫做基于request的驱动,如果跳过了request层,那么驱动就叫做bio based driver了。如何跳过呢?

设备驱动可以通过调用blk_queue_make_request注册一个make_request_fn函数,make_request_fn可以直接处理bio。generic_make_request会把设备指定的bio来调用make_request_fn,从而跳过了bio层下面的request 层,因为有些设备例如ssd盘,是不需要request层中的请求合并、排序等。

其实,这个方法并不是为SSD高性能设计的,本是为MD RAID实际的,用于处理请求并将其下发到底层真实的硬件设备中。

此外,bio based driver是为了绕过碰到的内核中单队列瓶颈,也带来一个问题:所有驱动都需要自己去处理并发明所有事情,代码不具备通用性。针对此事,内核中引入了多队列机制,后续bio based的驱动也是会慢慢绝迹的。Blk-mq:new multi-queue block IO queueing mechnism

       总体上看,bio层是比较薄的层,只负责将IO请求构建成bio机构体,然后传递给合适的make_request_fn()函数。而request层还是比较厚的,毕竟还有调度器、请求合并等操作。

  1. 请求分发逻辑
    1. 多队列
      1. 请求提交

多队列使用的make_request函数是blk_mq_make_request,当设备支持单硬件队列或异步请求时,将不会阻塞或大幅减轻阻塞。如果请求是同步的,驱动不会执行阻塞。

make_request函数会执行请求合并,如果设备允许阻塞,会负载搜索阻塞列表中的合适候选者,最后将软件队列映射到当前的cpu。提交路径不涉及I/O 调度相关的回调。

       make_request会将同步请求立即发送到对应的硬件队列,如果是异步或是flush(批量)的请求或被delay用于后续高效的合并的分发。

       针对同步和异步请求,make_request的处理存在一些差异。

      1. 请求分发

如果IO请求是同步的(在多队列中不允许阻塞),那么分发有同一个请求上下文来执行。

如果是异步或flush的,分发可能由关联到同一个硬件队列请求在其上下文执行;也可能延迟的工作调度来执行。

       多队列中由函数blk_mq_run_hw_queue来具体实现,同步请求会被立即分发到驱动,而异步请求会被延迟。当时同步请求时候,该函数会调用内部函数__blk_mq_run_hw_queue,先会加入和当前硬件队列相关的软件队列,然后加入已存在的分发列表,收集条目后,函数开始将每个请求分发到驱动中,最后由queue_rq来处理。

       整个逻辑代码位于函数blk_mq_make_request中。

       多队列的逻辑如下图:

       blk_mq的高清图链接如下:

https://github.com/kernel-z/filesystem/blob/master/blk_mq.png

 

    1. 单队列

函数generic_make_request在单队列中会调用blk_queue_bio,来负责处理bio结构体。该函数是块层中最重要的,需要重点关注的一个函数,里面内容也是极其丰富,第一次去看它肯定会“迷路”的。

       blk_queue_bio函数中会进行电梯调度的处理,由向前合并或向后合并,如果不能合并就新产生一个request请求,最后会调用blk_account_io_start记录io开始处理,很多io的监控统计就是从这个函数开始的。

       逻辑相对代码来说是简单很多,先判断bio是否可以合并到进程的plug链表中,不行则判断是否可以合并到块层请求队列中;如果都不支持合并,则重新产生一个新的request来处理bio,这里又分为是否可以阻塞的;如果可以阻塞则判断原阻塞队列是否需要刷盘,不需要刷则直接挂到plug队列中即可返回;如果不可阻塞的,就添加到请求队列中,并调用__blk_run_queue函数,该函数会调用rq->request_fn(由设备驱动指定,scsi的是scsi_request_fn),离开块层。这个也是blk_queue_bio的整体逻辑。如下图所示:

高清图链接如下:

https://github.com/kernel-z/filesystem/blob/master/blk_single.png

       plug中的请求,除了被新的blk_queue_bio函数调用链(blk_flush_plug_list)触发外,还可以被进程调度触发:

schedule->

sched_submit_work ->

blk_schedule_flush_plug()->

blk_flush_plug_list(plug, true) ->

queue_unplugged->

      blk_run_queue_async

    唤醒kblockd工作队列来进行unplug操作。

plug队列中的请求是要先刷到请求队列中,而最后都由__blk_run_queue往下发,会调用->request_fn函数,这个函数因驱动而已(scsi驱动是scsi_request_fn)。

      1. 函数小结

插入函数:__elv_add_request

拆分函数:blk_queue_split

合并函数:

bio_attemp_front_merge/bio_attempt_back_merge,blk_attemp_plug_merge

发起IO:  __blk_run_queue

  1. 块层函数初始化分析(scsi)

驱动初始化时候需要根据硬件设备确定驱动是否能使用多队列。从而在初始化时候确定请求入队列的函数块层入口(blk_queue_bio或者blk_mq_make_request),以及最后发起请求的函数离开块层(scsi_request_fn或scsi_queue_rq)。

    1. scsi为例
      1. scsi_alloc_sdev

驱动在探测scsi设备过程中,会使用函数scsi_alloc_sdev。会分配、初始化io,并返回指向scsi_device结构体的指针,scsi_device会存储主机、通道、id和lun,并将scsi_device添加到合适的列表中。

该会做如下判断:

if (shost_use_blk_mq(shost))

                sdev->request_queue = scsi_mq_alloc_queue(sdev);

        else

                sdev->request_queue = scsi_old_alloc_queue(sdev);

       如果是设备能使用多队列,则调用函数scsi_mq_alloc_queue,否则使用单队列,调用scsi_old_alloc_queue函数,其中参数sdev是scsi_device.

       在scsi_mq_alloc_queue函数中,会调用blk_mq_init_queue,最后会注册blk_mq_make_request函数。

       初始化逻辑如下,横向太大,所以给竖过来了:

高清图如下:

https://github.com/kernel-z/filesystem/blob/master/scsi-init.png

下面是关于具体代码中的一些结构体、函数的解释,结合上面的文字描述可以更好的理解块层。

  1. 结构体:关键结构体
    1. request

request结构体就是请求操作块设备的请求结构体,该结构体被放到request_queue队列中,等到合适的时候再处理。

该结构体定义在include/linux/blkdev.h文件中:

struct request {        

        struct request_queue *q;  //所在队列

        struct blk_mq_ctx *mq_ctx;

 

        int cpu;        

        unsigned int cmd_flags;         /* op and common flags */

        req_flags_t rq_flags;    

 

        int internal_tag;

 

        /* the following two fields are internal, NEVER access directly */

        unsigned int __data_len;        /* total data len */     

        int tag;        

        sector_t __sector;              /* sector cursor */      

 

        struct bio *bio;

        struct bio *biotail;     

 

        struct list_head queuelist; //请求结构体队列链表

 

        /*

         * The hash is used inside the scheduler, and killed once the

         * request reaches the dispatch list. The ipi_list is only used

         * to queue the request for softirq completion, which is long

         * after the request has been unhashed (and even removed from

         * the dispatch list).

         */

        union {

                struct hlist_node hash; /* merge hash */

                struct list_head ipi_list;

        };

 

        /*

         * The rb_node is only used inside the io scheduler, requests

         * are pruned when moved to the dispatch queue. So let the

         * completion_data share space with the rb_node.

         */

        union {

                struct rb_node rb_node; /* sort/lookup */

                struct bio_vec special_vec;

                void *completion_data;

                int error_count; /* for legacy drivers, don't use */

        };

        /*

         * Three pointers are available for the IO schedulers, if they need

         * more they have to dynamically allocate it.  Flush requests are

         * never put on the IO scheduler. So let the flush fields share

         * space with the elevator data.

         */

        union {

                struct {

                        struct io_cq            *icq;

                        void                    *priv[2];

                } elv;

 

                struct {

                        unsigned int            seq;

                        struct list_head        list;

                        rq_end_io_fn            *saved_end_io;

                } flush;

        };

 

        struct gendisk *rq_disk;

        struct hd_struct *part;

        unsigned long start_time;

        struct blk_issue_stat issue_stat;

        /* Number of scatter-gather DMA addr+len pairs after

         * physical address coalescing is performed.

         */

        unsigned short nr_phys_segments;

 

#if defined(CONFIG_BLK_DEV_INTEGRITY)

        unsigned short nr_integrity_segments;

#endif

 

        unsigned short write_hint;

        unsigned short ioprio;

 

        unsigned int timeout;

 

        void *special;          /* opaque pointer available for LLD use */

 

        unsigned int extra_len; /* length of alignment and padding */

 

        /*

         * On blk-mq, the lower bits of ->gstate (generation number and

         * state) carry the MQ_RQ_* state value and the upper bits the

         * generation number which is monotonically incremented and used to

         * distinguish the reuse instances.

         *

         * ->gstate_seq allows updates to ->gstate and other fields

         * (currently ->deadline) during request start to be read

         * atomically from the timeout path, so that it can operate on a

         * coherent set of information.

         */

        seqcount_t gstate_seq;

        u64 gstate;

 

        /*

         * ->aborted_gstate is used by the timeout to claim a specific

         * recycle instance of this request.  See blk_mq_timeout_work().

         */

        struct u64_stats_sync aborted_gstate_sync;

        u64 aborted_gstate;

 

        /* access through blk_rq_set_deadline, blk_rq_deadline */

        unsigned long __deadline;

 

        struct list_head timeout_list;

 

        union {

                struct __call_single_data csd;

                u64 fifo_time;

        };

 

        /*

         * completion callback.

         */

        rq_end_io_fn *end_io;

        void *end_io_data;

 

        /* for bidi */

        struct request *next_rq;

 

#ifdef CONFIG_BLK_CGROUP

        struct request_list *rl;                /* rl this rq is alloced from */

        unsigned long long start_time_ns;

        unsigned long long io_start_time_ns;    /* when passed to hardware */

#endif

};

 

表示块设备驱动层I/O请求,经由I/O调度层转换后的I/O请求,将会发到块设备驱动层进行处理;

 

    1. request_queue

每一块设备都会有一个队列,当需要对设备操作时,把请求放在队列中。因为对块设备的操作 I/O访问不能及时调用完成,I/O操作比较慢,所以把所有的请求放在队列中,等到合适的时候再处理这些请求;

该结构体定义在include/linux/blkdev.h文件中:

struct request_queue {

        /*

         * Together with queue_head for cacheline sharing

         */

        struct list_head        queue_head;//待处理请求的链表

        struct request          *last_merge;//队列中首先可能合并的请求描述符

        struct elevator_queue   *elevator;//指向elevator对象指针。

        int                     nr_rqs[2];      /* # allocated [a]sync rqs */

        int                     nr_rqs_elvpriv; /* # allocated rqs w/ elvpriv */

 

        atomic_t                shared_hctx_restart;

 

        struct blk_queue_stats  *stats;

        struct rq_wb            *rq_wb;

 

        /*

         * If blkcg is not used, @q->root_rl serves all requests.  If blkcg

         * is used, root blkg allocates from @q->root_rl and all other

         * blkgs from their own blkg->rl.  Which one to use should be

         * determined using bio_request_list().

         */

        struct request_list     root_rl;

 

        request_fn_proc         *request_fn;//驱动程序的策略例程入口点

        make_request_fn         *make_request_fn;

        poll_q_fn               *poll_fn;

        prep_rq_fn              *prep_rq_fn;

        unprep_rq_fn            *unprep_rq_fn;

        softirq_done_fn         *softirq_done_fn;

        rq_timed_out_fn         *rq_timed_out_fn;

        dma_drain_needed_fn     *dma_drain_needed;

        lld_busy_fn             *lld_busy_fn;

        /* Called just after a request is allocated */

        init_rq_fn              *init_rq_fn;

        /* Called just before a request is freed */

        exit_rq_fn              *exit_rq_fn;

        /* Called from inside blk_get_request() */

        void (*initialize_rq_fn)(struct request *rq);

 

        const struct blk_mq_ops *mq_ops;

 

        unsigned int            *mq_map;

 

        /* sw queues */

        struct blk_mq_ctx __percpu      *queue_ctx;

        unsigned int            nr_queues;

        unsigned int            queue_depth;

 

        /* hw dispatch queues */

        struct blk_mq_hw_ctx    **queue_hw_ctx;

        unsigned int            nr_hw_queues;

 

        /*

         * Dispatch queue sorting

         */

        sector_t                end_sector;

        struct request          *boundary_rq;

 

        /*

         * Delayed queue handling

         */

        struct delayed_work     delay_work;

 

        struct backing_dev_info *backing_dev_info;

 

        /*

         * The queue owner gets to use this for whatever they like.

         * ll_rw_blk doesn't touch it.

         */

        void                    *queuedata;

 

        /*

         * various queue flags, see QUEUE_* below

         */

        unsigned long           queue_flags;

 

        /*

         * ida allocated id for this queue.  Used to index queues from

         * ioctx.

         */

        int                     id;

 

        /*

         * queue needs bounce pages for pages above this limit

         */

        gfp_t                   bounce_gfp;

 

        /*

         * protects queue structures from reentrancy. ->__queue_lock should

         * _never_ be used directly, it is queue private. always use

         * ->queue_lock.

         */

        spinlock_t              __queue_lock;

        spinlock_t              *queue_lock;

 

        /*

         * queue kobject

         */

        struct kobject kobj;

 

        /*

         * mq queue kobject

         */

        struct kobject mq_kobj;

 

#ifdef  CONFIG_BLK_DEV_INTEGRITY

        struct blk_integrity integrity;

#endif  /* CONFIG_BLK_DEV_INTEGRITY */

 

#ifdef CONFIG_PM

        struct device           *dev;

        int                     rpm_status;

        unsigned int            nr_pending;

#endif

 

        /*

         * queue settings

         */

        unsigned long           nr_requests;    /* Max # of requests */

        unsigned int            nr_congestion_on;

        unsigned int            nr_congestion_off;

        unsigned int            nr_batching;

 

        unsigned int            dma_drain_size;

        void                    *dma_drain_buffer;

        unsigned int            dma_pad_mask;

        unsigned int            dma_alignment;

 

        struct blk_queue_tag    *queue_tags;

        struct list_head        tag_busy_list;

 

        unsigned int            nr_sorted;

        unsigned int            in_flight[2];

 

        /*

         * Number of active block driver functions for which blk_drain_queue()

         * must wait. Must be incremented around functions that unlock the

         * queue_lock internally, e.g. scsi_request_fn().

         */

        unsigned int            request_fn_active;

 

        unsigned int            rq_timeout;

        int                     poll_nsec;

 

        struct blk_stat_callback        *poll_cb;

        struct blk_rq_stat      poll_stat[BLK_MQ_POLL_STATS_BKTS];

 

        struct timer_list       timeout;

        struct work_struct      timeout_work;

        struct list_head        timeout_list;

 

        struct list_head        icq_list;

#ifdef CONFIG_BLK_CGROUP

        DECLARE_BITMAP          (blkcg_pols, BLKCG_MAX_POLS);

        struct blkcg_gq         *root_blkg;

        struct list_head        blkg_list;

#endif

 

        struct queue_limits     limits;

 

        /*

         * Zoned block device information for request dispatch control.

         * nr_zones is the total number of zones of the device. This is always

         * 0 for regular block devices. seq_zones_bitmap is a bitmap of nr_zones

         * bits which indicates if a zone is conventional (bit clear) or

         * sequential (bit set). seq_zones_wlock is a bitmap of nr_zones

         * bits which indicates if a zone is write locked, that is, if a write

         * request targeting the zone was dispatched. All three fields are

         * initialized by the low level device driver (e.g. scsi/sd.c).

         * Stacking drivers (device mappers) may or may not initialize

         * these fields.

         */

        unsigned int            nr_zones;

        unsigned long           *seq_zones_bitmap;

        unsigned long           *seq_zones_wlock;

 

        /*

         * sg stuff

         */

        unsigned int            sg_timeout;

        unsigned int            sg_reserved_size;

        int                     node;

#ifdef CONFIG_BLK_DEV_IO_TRACE

        struct blk_trace        *blk_trace;

        struct mutex            blk_trace_mutex;

#endif

        /*

         * for flush operations

         */

        struct blk_flush_queue  *fq;

 

        struct list_head        requeue_list;

        spinlock_t              requeue_lock;

        struct delayed_work     requeue_work;

 

        struct mutex            sysfs_lock;

 

        int                     bypass_depth;

        atomic_t                mq_freeze_depth;

 

#if defined(CONFIG_BLK_DEV_BSG)

        bsg_job_fn              *bsg_job_fn;

        struct bsg_class_device bsg_dev;

#endif

 

#ifdef CONFIG_BLK_DEV_THROTTLING

        /* Throttle data */

        struct throtl_data *td;

#endif

        struct rcu_head         rcu_head;

        wait_queue_head_t       mq_freeze_wq;

        struct percpu_ref       q_usage_counter;

        struct list_head        all_q_node;

 

        struct blk_mq_tag_set   *tag_set;

        struct list_head        tag_set_list;

        struct bio_set          *bio_split;

 

#ifdef CONFIG_BLK_DEBUG_FS

        struct dentry           *debugfs_dir;

        struct dentry           *sched_debugfs_dir;

#endif

 

        bool                    mq_sysfs_init_done;

 

        size_t                  cmd_size;

        void                    *rq_alloc_data;

 

        struct work_struct      release_work;

#define BLK_MAX_WRITE_HINTS     5

        u64                     write_hints[BLK_MAX_WRITE_HINTS];

};

       该结构体还是异常庞大的,都快接近sk_buff结构体了。

维护块设备驱动层I/O请求的队列,所有的request都插入到该队列,每个磁盘设备都只有一个queue(多个分区也只有一个);

一个request_queue中包含多个request,每个request可能包含多个bio,请求的合并就是根据各种原则将多个bio加入到同一个request中。

    1. elevator_queue

电梯调度队列,每个队列都会有一个电梯调度队列。

struct elevator_queue

{

        struct elevator_type *type;

        void *elevator_data;

        struct kobject kobj;

        struct mutex sysfs_lock;

        unsigned int registered:1;

        unsigned int uses_mq:1;

        DECLARE_HASHTABLE(hash, ELV_HASH_BITS);

};

    1. elevator_type

电梯类型其实就是调度算法类型。

struct elevator_type

{              

        /* managed by elevator core */

        struct kmem_cache *icq_cache;

                

        /* fields provided by elevator implementation */

        union {

                struct elevator_ops sq;

                struct elevator_mq_ops mq;

        } ops;

        size_t icq_size;        /* see iocontext.h */

        size_t icq_align;       /* ditto */

        struct elv_fs_entry *elevator_attrs;

        char elevator_name[ELV_NAME_MAX];

        const char *elevator_alias;

        struct module *elevator_owner;

        bool uses_mq;

#ifdef CONFIG_BLK_DEBUG_FS

        const struct blk_mq_debugfs_attr *queue_debugfs_attrs;

        const struct blk_mq_debugfs_attr *hctx_debugfs_attrs;

#endif 

 

        /* managed by elevator core */

        char icq_cache_name[ELV_NAME_MAX + 6];  /* elvname + "_io_cq" */

        struct list_head list;

};

      1. iosched_cfq

例如cfq调度器结构体,指定了该调度器相关的所有函数。

static struct elevator_type iosched_cfq = {

        .ops.sq = {

                .elevator_merge_fn =            cfq_merge,

                .elevator_merged_fn =           cfq_merged_request,

                .elevator_merge_req_fn =        cfq_merged_requests,

                .elevator_allow_bio_merge_fn =  cfq_allow_bio_merge,

                .elevator_allow_rq_merge_fn =   cfq_allow_rq_merge,

                .elevator_bio_merged_fn =       cfq_bio_merged,

                .elevator_dispatch_fn =         cfq_dispatch_requests,

                .elevator_add_req_fn =          cfq_insert_request,

                .elevator_activate_req_fn =     cfq_activate_request,

                .elevator_deactivate_req_fn =   cfq_deactivate_request,

                .elevator_completed_req_fn =    cfq_completed_request,

                .elevator_former_req_fn =       elv_rb_former_request,

                .elevator_latter_req_fn =       elv_rb_latter_request,

                .elevator_init_icq_fn =         cfq_init_icq,

                .elevator_exit_icq_fn =         cfq_exit_icq,

                .elevator_set_req_fn =          cfq_set_request,

                .elevator_put_req_fn =          cfq_put_request,

                .elevator_may_queue_fn =        cfq_may_queue,

                .elevator_init_fn =             cfq_init_queue,

                .elevator_exit_fn =             cfq_exit_queue,

                .elevator_registered_fn =       cfq_registered_queue,

        },

        .icq_size       =       sizeof(struct cfq_io_cq),

        .icq_align      =       __alignof__(struct cfq_io_cq),

        .elevator_attrs =       cfq_attrs,

        .elevator_name  =       "cfq",

        .elevator_owner =       THIS_MODULE,

};

    1. gendisk

再来看下磁盘的数据结构gendisk (定义于 <include/linux/genhd.h>) ,是单独一个磁盘驱动器的内核表示。是块I/O子系统中最重要的数据结构。

struct gendisk {

        /* major, first_minor and minors are input parameters only,

         * don't use directly.  Use disk_devt() and disk_max_parts().

         */

        int major;                      /* major number of driver */

        int first_minor;

        int minors;                     /* maximum number of minors, =1 for

                                         * disks that can't be partitioned. */

 

        char disk_name[DISK_NAME_LEN];  /* name of major driver */

        char *(*devnode)(struct gendisk *gd, umode_t *mode);

 

        unsigned int events;            /* supported events */

        unsigned int async_events;      /* async events, subset of all */

 

        /* Array of pointers to partitions indexed by partno.

         * Protected with matching bdev lock but stat and other

         * non-critical accesses use RCU.  Always access through

         * helpers.

         */

        struct disk_part_tbl __rcu *part_tbl;

        struct hd_struct part0;

 

        const struct block_device_operations *fops;

        struct request_queue *queue;

        void *private_data;

 

        int flags;

        struct rw_semaphore lookup_sem;

        struct kobject *slave_dir;

 

        struct timer_rand_state *random;

        atomic_t sync_io;               /* RAID */

        struct disk_events *ev;

#ifdef  CONFIG_BLK_DEV_INTEGRITY

        struct kobject integrity_kobj;

#endif  /* CONFIG_BLK_DEV_INTEGRITY */

        int node_id;

        struct badblocks *bb;

        struct lockdep_map lockdep_map;

};

    该结构体中有设备号、次编号(标记不同分区)、磁盘驱动器名字(出现在/proc/partitionssysfs中)、 设备的操作集(block_device_operations)、设备IO请求结构、驱动器状态、驱动器容量、驱动内部数据指针private_data等。

       和gendisk相关的函数有,alloc_disk函数用来分配一个磁盘,del_gendisk用来减掉一个对结构体的引用。

分配一个 gendisk 结构不能使系统可使用这个磁盘。还必须初始化这个结构并且调用 add_disk。一旦调用add_disk后, 这个磁盘是"活的"并且它的方法可被在任何时间被调用了,内核这个时候就可以来摸设备了。实际上第一个调用将可能发生, 也可能在 add_disk 函数返回之前; 内核将读前几个字节以试图找到一个分区表。在驱动被完全初始化并且准备好之前,不要调用add_disk来响应对磁盘的请求。

    1. hd_struct

磁盘分区结构体。

struct hd_struct {

        sector_t start_sect;

        /*

         * nr_sects is protected by sequence counter. One might extend a

         * partition while IO is happening to it and update of nr_sects

         * can be non-atomic on 32bit machines with 64bit sector_t.

         */

        sector_t nr_sects;

        seqcount_t nr_sects_seq;

        sector_t alignment_offset;

        unsigned int discard_alignment;

        struct device __dev;

        struct kobject *holder_dir;

        int policy, partno;

        struct partition_meta_info *info;

#ifdef CONFIG_FAIL_MAKE_REQUEST

        int make_it_fail;

#endif

        unsigned long stamp;

        atomic_t in_flight[2];

#ifdef  CONFIG_SMP

        struct disk_stats __percpu *dkstats;

#else

        struct disk_stats dkstats;

#endif

        struct percpu_ref ref;

        struct rcu_head rcu_head;

};

    1. bio

在2.4内核以前使用缓冲头的方式,该方式下会将每个I/O请求分解成512字节的块,所以不能创建高性能IO子系统。2.5中一个重要的工作就是支持高性能I/O,于是有了现在的BIO结构体。

bio结构体是request结构体的实际数据,一个request结构体中包含一个或者多个bio结构体,在底层实际是按bio来对设备进行操作的,传递给驱动。

代码会把它合并到一个已经存在的request结构体中,或者需要的话会再创建一个新的request结构体;bio结构体包含了驱动程序执行请求的全部信息。

一个bio包含多个page,这些page对应磁盘上一段连续的空间。由于文件在磁盘上并不连续存放,文件I/O提交到块设备之前,极有可能被拆成多个bio结构;

该结构体定义在include/linux/blk_types.h文件中,不幸的是该结构和以往发生了一些较大变化,特别是与ldd一书中不匹配了。

/*

 * main unit of I/O for the block layer and lower layers (ie drivers and

 * stacking drivers)

 */

struct bio {

        struct bio              *bi_next;       /* request queue link */

        struct gendisk          *bi_disk;

        unsigned int            bi_opf;         /* bottom bits req flags,

                                                 * top bits REQ_OP. Use

                                                 * accessors.

                                                 */

        unsigned short          bi_flags;       /* status, etc and bvec pool number */

        unsigned short          bi_ioprio;

        unsigned short          bi_write_hint;

        blk_status_t            bi_status;

        u8                      bi_partno;

 

        /* Number of segments in this BIO after

         * physical address coalescing is performed.

         */

        unsigned int            bi_phys_segments;

 

        /*

         * To keep track of the max segment size, we account for the

         * sizes of the first and last mergeable segments in this bio.

         */

        unsigned int            bi_seg_front_size;

        unsigned int            bi_seg_back_size;

 

        struct bvec_iter        bi_iter;

 

        atomic_t                __bi_remaining;

        bio_end_io_t            *bi_end_io;

 

        void                    *bi_private;

#ifdef CONFIG_BLK_CGROUP

        /*

         * Optional ioc and css associated with this bio.  Put on bio

         * release.  Read comment on top of bio_associate_current().

         */

        struct io_context       *bi_ioc;

        struct cgroup_subsys_state *bi_css;

#ifdef CONFIG_BLK_DEV_THROTTLING_LOW

        void                    *bi_cg_private;

        struct blk_issue_stat   bi_issue_stat;

#endif

#endif

        union {

#if defined(CONFIG_BLK_DEV_INTEGRITY)

                struct bio_integrity_payload *bi_integrity; /* data integrity */

#endif

        };

 

        unsigned short          bi_vcnt;        /* how many bio_vec's */

 

        /*

         * Everything starting with bi_max_vecs will be preserved by bio_reset()

         */

 

        unsigned short          bi_max_vecs;    /* max bvl_vecs we can hold */

 

        atomic_t                __bi_cnt;       /* pin count */

 

        struct bio_vec          *bi_io_vec;     /* the actual vec list */

 

        struct bio_set          *bi_pool;

 

        /*

         * We can inline a number of vecs at the end of the bio, to avoid

         * double allocations for a small number of bio_vecs. This member

         * MUST obviously be kept at the very end of the bio.

         */

        struct bio_vec          bi_inline_vecs[0];

};

    1. bio_vec

其中bio_vec结构体位于文件include/linux/bvec.h中:

struct bio_vec {

        struct page     *bv_page; //指向整个缓冲区所驻留的物理页面

        unsigned int    bv_len;   //以字节为单位的大小

        unsigned int    bv_offset;//以字节为单位的偏移量

};

    1. elevator_type

       电梯调度类型,例如AS或者deadline调度类型。

struct elevator_type

{

        /* managed by elevator core */

        struct kmem_cache *icq_cache;

 

        /* fields provided by elevator implementation */

        union {

                struct elevator_ops sq;

                struct elevator_mq_ops mq;

        } ops;

        size_t icq_size;        /* see iocontext.h */

        size_t icq_align;       /* ditto */

        struct elv_fs_entry *elevator_attrs;

        char elevator_name[ELV_NAME_MAX];

        const char *elevator_alias;

        struct module *elevator_owner;

        bool uses_mq;

#ifdef CONFIG_BLK_DEBUG_FS

        const struct blk_mq_debugfs_attr *queue_debugfs_attrs;

        const struct blk_mq_debugfs_attr *hctx_debugfs_attrs;

#endif

 

        /* managed by elevator core */

        char icq_cache_name[ELV_NAME_MAX + 6];  /* elvname + "_io_cq" */

        struct list_head list;

};

 

    1. 多队列结构体
      1. blk_mq_ctx

代表software staging queues.

struct blk_mq_ctx {

        struct {

                spinlock_t              lock;

                struct list_head        rq_list;

        }  ____cacheline_aligned_in_smp;

       

        unsigned int            cpu;

        unsigned int            index_hw;

       

        /* incremented at dispatch time */

        unsigned long           rq_dispatched[2];

        unsigned long           rq_merged;

       

        /* incremented at completion time */   

        unsigned long           ____cacheline_aligned_in_smp rq_completed[2];

       

        struct request_queue    *queue;

        struct kobject          kobj;

}

 

      1. blk_mq_hw_ctx

多队列的硬件队列。它和blk_mq_ctx的映射通过blk_mq_ops中map_queues来实现。同时映射也保存在request_queue中的mq_map中。

/**

 * struct blk_mq_hw_ctx - State for a hardware queue facing the hardware block device

 */

struct blk_mq_hw_ctx {

        struct {

                spinlock_t              lock;

                struct list_head        dispatch;

                unsigned long           state;          /* BLK_MQ_S_* flags */

        } ____cacheline_aligned_in_smp;

 

        struct delayed_work     run_work;

        cpumask_var_t           cpumask;

        int                     next_cpu;

        int                     next_cpu_batch;

                       

        unsigned long           flags;          /* BLK_MQ_F_* flags */

               

        void                    *sched_data;

        struct request_queue    *queue;

        struct blk_flush_queue  *fq;

 

        void                    *driver_data;                                                                                               

 

        struct sbitmap          ctx_map;                                                                                                    

 

        struct blk_mq_ctx       *dispatch_from;

 

        struct blk_mq_ctx       **ctxs;

        unsigned int            nr_ctx;

       

        wait_queue_entry_t      dispatch_wait;

        atomic_t                wait_index;

 

        struct blk_mq_tags      *tags;

        struct blk_mq_tags      *sched_tags;

 

        unsigned long           queued;

        unsigned long           run;

#define BLK_MQ_MAX_DISPATCH_ORDER       7

        unsigned long           dispatched[BLK_MQ_MAX_DISPATCH_ORDER];

 

        unsigned int            numa_node;

        unsigned int            queue_num;

 

        atomic_t                nr_active;

        unsigned int            nr_expired;

 

        struct hlist_node       cpuhp_dead;

        struct kobject          kobj;

 

        unsigned long           poll_considered;

        unsigned long           poll_invoked;

        unsigned long           poll_success;

 

#ifdef CONFIG_BLK_DEBUG_FS

        struct dentry           *debugfs_dir;

        struct dentry           *sched_debugfs_dir;

#endif

 

        /* Must be the last member - see also blk_mq_hw_ctx_size(). */

        struct srcu_struct      srcu[0];

};

 

struct blk_mq_tag_set {

        unsigned int            *mq_map;

        const struct blk_mq_ops *ops;

        unsigned int            nr_hw_queues;

        unsigned int            queue_depth;    /* max hw supported */

        unsigned int            reserved_tags;

        unsigned int            cmd_size;       /* per-request extra data */

        int                     numa_node;

        unsigned int            timeout;

        unsigned int            flags;          /* BLK_MQ_F_* */

        void                    *driver_data;

 

        struct blk_mq_tags      **tags;

 

        struct mutex            tag_list_lock;

        struct list_head        tag_list;

};

 

    1. 函数操作结构体
      1. elevator_ops

调度操作函数集合。

struct elevator_ops

{

        elevator_merge_fn *elevator_merge_fn;

        elevator_merged_fn *elevator_merged_fn;

        elevator_merge_req_fn *elevator_merge_req_fn;

        elevator_allow_bio_merge_fn *elevator_allow_bio_merge_fn;

        elevator_allow_rq_merge_fn *elevator_allow_rq_merge_fn;

        elevator_bio_merged_fn *elevator_bio_merged_fn;

 

        elevator_dispatch_fn *elevator_dispatch_fn;

        elevator_add_req_fn *elevator_add_req_fn;

        elevator_activate_req_fn *elevator_activate_req_fn;

        elevator_deactivate_req_fn *elevator_deactivate_req_fn;

 

        elevator_completed_req_fn *elevator_completed_req_fn;

 

        elevator_request_list_fn *elevator_former_req_fn;

        elevator_request_list_fn *elevator_latter_req_fn;

 

        elevator_init_icq_fn *elevator_init_icq_fn;     /* see iocontext.h */

        elevator_exit_icq_fn *elevator_exit_icq_fn;     /* ditto */

 

        elevator_set_req_fn *elevator_set_req_fn;

        elevator_put_req_fn *elevator_put_req_fn;

 

        elevator_may_queue_fn *elevator_may_queue_fn;

 

        elevator_init_fn *elevator_init_fn;

        elevator_exit_fn *elevator_exit_fn;

        elevator_registered_fn *elevator_registered_fn;

};

 

 

 

      1. elevator_mq_ops

多队列调度操作函数集合。

struct elevator_mq_ops {

        int (*init_sched)(struct request_queue *, struct elevator_type *);

        void (*exit_sched)(struct elevator_queue *);

        int (*init_hctx)(struct blk_mq_hw_ctx *, unsigned int);

        void (*exit_hctx)(struct blk_mq_hw_ctx *, unsigned int);

 

        bool (*allow_merge)(struct request_queue *, struct request *, struct bio *);

        bool (*bio_merge)(struct blk_mq_hw_ctx *, struct bio *);

        int (*request_merge)(struct request_queue *q, struct request **, struct bio *);

        void (*request_merged)(struct request_queue *, struct request *, enum elv_merge);

        void (*requests_merged)(struct request_queue *, struct request *, struct request *);

        void (*limit_depth)(unsigned int, struct blk_mq_alloc_data *);

        void (*prepare_request)(struct request *, struct bio *bio);

        void (*finish_request)(struct request *);

        void (*insert_requests)(struct blk_mq_hw_ctx *, struct list_head *, bool);

        struct request *(*dispatch_request)(struct blk_mq_hw_ctx *);

        bool (*has_work)(struct blk_mq_hw_ctx *);

        void (*completed_request)(struct request *);

        void (*started_request)(struct request *);

        void (*requeue_request)(struct request *);

        struct request *(*former_request)(struct request_queue *, struct request *);

        struct request *(*next_request)(struct request_queue *, struct request *);

        void (*init_icq)(struct io_cq *);

        void (*exit_icq)(struct io_cq *);

};

      1. 多队列
        1. blk_mq_ops

多队列的操作函数,是块层多队列和块设备的桥梁,非常重要。

struct blk_mq_ops {

        /*     

         * Queue request

         */    

        queue_rq_fn             *queue_rq;//队列处理函数。

               

        /*

         * Reserve budget before queue request, once .queue_rq is

         * run, it is driver's responsibility to release the

         * reserved budget. Also we have to handle failure case

         * of .get_budget for avoiding I/O deadlock.

         */

        get_budget_fn           *get_budget;

        put_budget_fn           *put_budget;

       

        /*

         * Called on request timeout

         */

        timeout_fn              *timeout;

       

        /*     

         * Called to poll for completion of a specific tag.

         */

        poll_fn                 *poll;

 

        softirq_done_fn         *complete;

 

        /*

         * Called when the block layer side of a hardware queue has been

         * set up, allowing the driver to allocate/init matching structures.

         * Ditto for exit/teardown.

         */

        init_hctx_fn            *init_hctx;

        exit_hctx_fn            *exit_hctx;

 

        /*

         * Called for every command allocated by the block layer to allow

         * the driver to set up driver specific data.

         *

         * Tag greater than or equal to queue_depth is for setting up

         * flush request.

         *

         * Ditto for exit/teardown.

         */

        init_hctx_fn            *init_hctx;

        exit_hctx_fn            *exit_hctx;

 

        /*

         * Called for every command allocated by the block layer to allow

         * the driver to set up driver specific data.

         *

         * Tag greater than or equal to queue_depth is for setting up

         * flush request.

         *

         * Ditto for exit/teardown.

         */

        init_request_fn         *init_request;

        exit_request_fn         *exit_request;

        /* Called from inside blk_get_request() */

        void (*initialize_rq_fn)(struct request *rq);

 

        map_queues_fn           *map_queues;//blk_mq_ctx和blk_mq_hw_ctx映射关系

 

#ifdef CONFIG_BLK_DEBUG_FS

        /*

         * Used by the debugfs implementation to show driver-specific

         * information about a request.

         */

        void (*show_rq)(struct seq_file *m, struct request *rq);

#endif

};

           例如scsi的多队列操作函数集合。

          1. scsi_mq_ops

最新scsi驱动使用的多队列函数操作集合,老的单队列处理函数是scsi_request_fn。

static const struct blk_mq_ops scsi_mq_ops = {

        .get_budget     = scsi_mq_get_budget,

        .put_budget     = scsi_mq_put_budget,

        .queue_rq       = scsi_queue_rq,

        .complete       = scsi_softirq_done,

        .timeout        = scsi_timeout,

#ifdef CONFIG_BLK_DEBUG_FS

        .show_rq        = scsi_show_rq,

#endif

        .init_request   = scsi_mq_init_request,

        .exit_request   = scsi_mq_exit_request,

        .initialize_rq_fn = scsi_initialize_rq,

        .map_queues     = scsi_map_queues,

};

  1. 函数:主要函数
    1. 多队列
      1. blk_mq_flush_plug_list

多队列中刷plug队列中请求的函数。会有blk_flush_plug_list函数调用。

      1. blk_mq_make_request多队列块入口

这个函数和单队列的blk_queue_bio对立,是多队列的入口函数。整个函数逻辑也体现了多队列中io的处理流程。

       总体逻辑和单队列的blk_queue_bio函数非常相似。

       如果能够合入到进程的plug队列中就直接合入后并返回。否则,通过函数blk_mq_sched_bio_merge来进行合并到请求队列中。不管合并到哪里,都分为向前合并和向后合并两种方式。

       如果不能合并bio,则需要根据bio来生成一个request结构体。新产生的request会根据bio有多种执行分支,判断条件有flush操作、sync、plug等,最终都是调用blk_mq_run_hw_queue来向设备发起io请求。

static blk_qc_t blk_mq_make_request(struct request_queue *q, struct bio *bio)

{      

        const int is_sync = op_is_sync(bio->bi_opf);

        const int is_flush_fua = op_is_flush(bio->bi_opf);

        struct blk_mq_alloc_data data = { .flags = 0 };

        struct request *rq;

        unsigned int request_count = 0;

        struct blk_plug *plug;

        struct request *same_queue_rq = NULL;

        blk_qc_t cookie;

        unsigned int wb_acct;

       

        blk_queue_bounce(q, &bio);

       

        blk_queue_split(q, &bio);//根据设备硬件上限来分割bio

       

        if (!bio_integrity_prep(bio))

                return BLK_QC_T_NONE;

               

        if (!is_flush_fua && !blk_queue_nomerges(q) &&

            blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq))//合并到进程的plug队列

                return BLK_QC_T_NONE;

 

        if (blk_mq_sched_bio_merge(q, bio))//合并到请求队列中,成功返回

                return BLK_QC_T_NONE;

 

        wb_acct = wbt_wait(q->rq_wb, bio, NULL);

 

        trace_block_getrq(q, bio, bio->bi_opf);

 

        rq = blk_mq_get_request(q, bio, bio->bi_opf, &data);//无法合并,产生新的request 请求

        if (unlikely(!rq)) {

                __wbt_done(q->rq_wb, wb_acct);

                if (bio->bi_opf & REQ_NOWAIT)

                        bio_wouldblock_error(bio);

                return BLK_QC_T_NONE;

        }

        wbt_track(&rq->issue_stat, wb_acct);

 

        cookie = request_to_qc_t(data.hctx, rq);

 

        plug = current->plug;

        if (unlikely(is_flush_fua)) {//是flush操作

                blk_mq_put_ctx(data.ctx);

                blk_mq_bio_to_request(rq, bio);//根据bio生成request,继续下方到硬件队列

 

                /* bypass scheduler for flush rq */

                blk_insert_flush(rq);

                blk_mq_run_hw_queue(data.hctx, true);//向设备发起io请求

        } else if (plug && q->nr_hw_queues == 1) {//可以plug,同时硬件队列数量为1。

                struct request *last = NULL;

 

                blk_mq_put_ctx(data.ctx);

                blk_mq_bio_to_request(rq, bio);

 

                /*

                 * @request_count may become stale because of schedule

                 * out, so check the list again.

                 */

                if (list_empty(&plug->mq_list))

                        request_count = 0;

                else if (blk_queue_nomerges(q))

                        request_count = blk_plug_queued_count(q);

 

                if (!request_count)

                        trace_block_plug(q);

                else

                        last = list_entry_rq(plug->mq_list.prev);

 

                if (request_count >= BLK_MAX_REQUEST_COUNT || (last &&

                    blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE)) {

                        blk_flush_plug_list(plug, false);

                        trace_block_plug(q);

                }

 

                list_add_tail(&rq->queuelist, &plug->mq_list);

        } else if (plug && !blk_queue_nomerges(q)) {

                blk_mq_bio_to_request(rq, bio);

 

                /*

                 * We do limited plugging. If the bio can be merged, do that.

                 * Otherwise the existing request in the plug list will be

                 * issued. So the plug list will have one request at most

                 * The plug list might get flushed before this. If that happens,

                 * the plug list is empty, and same_queue_rq is invalid.

                 */

                if (list_empty(&plug->mq_list))

                        same_queue_rq = NULL;

                if (same_queue_rq)

                        list_del_init(&same_queue_rq->queuelist);

                list_add_tail(&rq->queuelist, &plug->mq_list);

 

                blk_mq_put_ctx(data.ctx);

 

                if (same_queue_rq) {

                        data.hctx = blk_mq_map_queue(q,

                                        same_queue_rq->mq_ctx->cpu);

                        blk_mq_try_issue_directly(data.hctx, same_queue_rq,

                                        &cookie);

                }

        } else if (q->nr_hw_queues > 1 && is_sync) {

                blk_mq_put_ctx(data.ctx);

                blk_mq_bio_to_request(rq, bio);

                blk_mq_try_issue_directly(data.hctx, rq, &cookie);

        } else if (q->elevator) {

                blk_mq_put_ctx(data.ctx);

                blk_mq_bio_to_request(rq, bio);

                blk_mq_sched_insert_request(rq, false, true, true);

        } else {

                blk_mq_put_ctx(data.ctx);

                blk_mq_bio_to_request(rq, bio);

                blk_mq_queue_io(data.hctx, data.ctx, rq);

                blk_mq_run_hw_queue(data.hctx, true);

        }

 

        return cookie;

}

    1. 单队列
      1. blk _flush_plug_list

对应多队列的blk_mq_flush_plug_list函数,负责将进程中plug链上的bio通过函数__elv_add_request刷到调度队列中,并调用__blk_run_queue函数发起io。

      1. blk_queue_bio单队列块层入口

这个函数是单队列的请求处理函数,负责将bio放入到队列中。由generic_make_request函数调用。将来如果多队列完全体会了单队列,那么这个函数就成为历史了。

该函数提交的 bio 的缓存处理存在以下几种情况,

  1. 当前进程 IO 处于 Plug 状态,尝试将 bio 合并到当前进程的 plugged list 里,即 current->plug.list 。
  2. 当前进程 IO 处于 Unplug 状态,尝试利用 IO 调度器的代码找到合适的 IO request,并将 bio 合并到该 request 中。
  3. 如果无法将 bio 合并到已经存在的 IO request 结构里,那么就进入到单独为该 bio 分配空闲 IO request 的逻辑里。

不论是 plugged list 还是 IO scheduler 的 IO 合并,都分为向前合并和向后合并两种情况,

向后由 bio_attempt_back_merge 完成。

向前由 bio_attempt_front_merge 完成。

static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio)

{      

        struct blk_plug *plug;//阻塞结构体

        int where = ELEVATOR_INSERT_SORT;

        struct request *req, *free;

        unsigned int request_count = 0;

        unsigned int wb_acct;

 

        /*

         * low level driver can indicate that it wants pages above a

         * certain limit bounced to low memory (ie for highmem, or even

         * ISA dma in theory)

         */

        blk_queue_bounce(q, &bio);

                       

        blk_queue_split(q, &bio);//根据块设备请求队列的limits.max_sectorslimits.max_segmetns来拆分bio,适应设备缓存。会在函数blk_set_default_limits中设置。

              

        if (!bio_integrity_prep(bio))//判断bio是否完整

                return BLK_QC_T_NONE;

 

        if (op_is_flush(bio->bi_opf)) {//判断bio是否是REQ_PREFLUSH或者REQ_FUA, 需要特殊处理

                spin_lock_irq(q->queue_lock);

                where = ELEVATOR_INSERT_FLUSH;

                goto get_rq;

        }

 

        /*

         * Check if we can merge with the plugged list before grabbing

         * any locks.

         */

        if (!blk_queue_nomerges(q)) {//判断队列能否合并,由QUEUE_FLAG_NOMERGES

                if (blk_attempt_plug_merge(q, bio, &request_count, NULL)) //尝试将bio合并到进程plug列表中,然后直接返回,等后续触发再处理阻塞队列。

                        return BLK_QC_T_NONE;

        } else

                request_count = blk_plug_queued_count(q);//获取plug队列中的请求数量即可。

 

        spin_lock_irq(q->queue_lock);

 

        switch (elv_merge(q, &req, bio)) {//单队列的io调度层,进入到电梯调度函数。

        case ELEVATOR_BACK_MERGE://向后合并,bio合入到已经存在的request,合并后,调用blk_account_io_start结束

                if (!bio_attempt_back_merge(q, req, bio)) //向后合并函数

                        break;

                elv_bio_merged(q, req, bio);

                free = attempt_back_merge(q, req);

                if (free)

                        __blk_put_request(q, free);

                else

                        elv_merged_request(q, req, ELEVATOR_BACK_MERGE);

                goto out_unlock;

        case ELEVATOR_FRONT_MERGE://向前合并,bio合入到已经存在的request,合并后,调用blk_account_io_start结束

                if (!bio_attempt_front_merge(q, req, bio))

                        break;

                elv_bio_merged(q, req, bio);

                free = attempt_front_merge(q, req);

                if (free)

                        __blk_put_request(q, free);

                else

                        elv_merged_request(q, req, ELEVATOR_FRONT_MERGE);

                goto out_unlock;

        default:

                break;

        }

 

get_rq:

        wb_acct = wbt_wait(q->rq_wb, bio, q->queue_lock);

 

        /*

         * Grab a free request. This is might sleep but can not fail.

         * Returns with the queue unlocked.

         */

        blk_queue_enter_live(q);

        req = get_request(q, bio->bi_opf, bio, 0); //如果在plug链和request队列中都无法合并,则重新生成一个request.

        if (IS_ERR(req)) {

                blk_queue_exit(q);

                __wbt_done(q->rq_wb, wb_acct);

                if (PTR_ERR(req) == -ENOMEM)

                        bio->bi_status = BLK_STS_RESOURCE;

                else

                        bio->bi_status = BLK_STS_IOERR;

                bio_endio(bio);

                goto out_unlock;

        }

        wbt_track(&req->issue_stat, wb_acct);

 

        /*

         * After dropping the lock and possibly sleeping here, our request

         * may now be mergeable after it had proven unmergeable (above).

         * We don't worry about that case for efficiency. It won't happen

         * often, and the elevators are able to handle it.

         */

        blk_init_request_from_bio(req, bio);//通过bio初始化request请求。

 

        if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags))

                req->cpu = raw_smp_processor_id();

 

        plug = current->plug;

        if (plug) {

                /*

                 * If this is the first request added after a plug, fire

                 * of a plug trace.

                 *

                 * @request_count may become stale because of schedule

                 * out, so check plug list again.

                 */

                if (!request_count || list_empty(&plug->list))

                        trace_block_plug(q);

                else {

                        struct request *last = list_entry_rq(plug->list.prev);

                        if (request_count >= BLK_MAX_REQUEST_COUNT ||

                            blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE) {

                                blk_flush_plug_list(plug, false);//如果请求数量或者大小超过指定,就触发刷阻塞的io,第二参数表示不是从调度触发的,是自己触发的。会调用__elv_add_request将请求插入到电梯队列中。

                                trace_block_plug(q);

                        }

                }

                list_add_tail(&req->queuelist, &plug->list);//把请求添加到plug列表中

                blk_account_io_start(req, true);//启动队列中的io静态相关统计.

        } else {

                spin_lock_irq(q->queue_lock);

                add_acct_request(q, req, where);//该函数会调用blk_account_io_start__elv_add_request,将请求放入到请求队列中,准备被处理。

                __blk_run_queue(q);//如果非阻塞,则调用__blk_run_queue函数,触发IO,开工干活。

out_unlock:

                spin_unlock_irq(q->queue_lock);

        }

 

        return BLK_QC_T_NONE;

}

 

    1. 初始化函数
      1. blk_mq_init_queue

       这个函数初始化软件(software staging queues)和硬件(hardware dispatch queues)队列,同时执行映射操作。

也会通过调用blk_queue_make_request来设置blk_mq_make_request函数。

 

struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)

{

        struct request_queue *uninit_q, *q;

 

        uninit_q = blk_alloc_queue_node(GFP_KERNEL, set->numa_node, NULL);

        if (!uninit_q)

                return ERR_PTR(-ENOMEM);

       

        q = blk_mq_init_allocated_queue(set, uninit_q);

        if (IS_ERR(q))

                blk_cleanup_queue(uninit_q);

       

        return q;

}

 

      1. blk_mq_init_request

该函数会调用.init_request函数。

static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,

                               unsigned int hctx_idx, int node)

{

        int ret;

 

        if (set->ops->init_request) {

                ret = set->ops->init_request(set, rq, hctx_idx, node);

                if (ret)

                        return ret;

        }

 

        seqcount_init(&rq->gstate_seq);

        u64_stats_init(&rq->aborted_gstate_sync);

        /*

         * start gstate with gen 1 instead of 0, otherwise it will be equal

         * to aborted_gstate, and be identified timed out by

         * blk_mq_terminate_expired.

         */

        WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);

 

        return 0;

}

 

      1. blk_init_queue

初始化队列函数,会调用blk_init_queue_node。

struct request_queue *blk_init_queue(request_fn_proc *rfn, spinlock_t *lock)

{

        return blk_init_queue_node(rfn, lock, NUMA_NO_NODE);

}

会调用blk_init_queue_node函数,而函数blk_init_queue_node会调用blk_init_allocated_queue 函数。

struct request_queue *

blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)

{

        struct request_queue *q;

        q = blk_alloc_queue_node(GFP_KERNEL, node_id, lock);

        if (!q)

                return NULL;

               

        q->request_fn = rfn;

        if (blk_init_allocated_queue(q) < 0) {

                blk_cleanup_queue(q);

                return NULL;

        }

        return q;

}

 

      1. blk_queue_make_request

   blk_queue_make_request用来设置多队列的入口函数:blk_mq_make_request函数

 

    1. 关键承上启下函数
      1. generic_make_request

       这个函数本身起到一个承上启下的作用,所以在函数定义处加入了大量的描述性文字,帮助开发者理解。

generic_make_request函数是bio层的入口,负责把bio传递给块层,将bio结构体到请求队列。如果是使用单队列则调用blk_queue_bio,如果是使用多队列的则调用blk_mq_make_request。

/**

 * generic_make_request - hand a buffer to its device driver for I/O

 * @bio:  The bio describing the location in memory and on the device.

 *

 * generic_make_request() is used to make I/O requests of block

 * devices. It is passed a &struct bio, which describes the I/O that needs

 * to be done.

 *

 * generic_make_request() does not return any status.  The

 * success/failure status of the request, along with notification of

 * completion, is delivered asynchronously through the bio->bi_end_io

 * function described (one day) else where.

 *

 * The caller of generic_make_request must make sure that bi_io_vec

 * are set to describe the memory buffer, and that bi_dev and bi_sector are

 * set to describe the device address, and the

 * bi_end_io and optionally bi_private are set to describe how

 * completion notification should be signaled.

 *

 * generic_make_request and the drivers it calls may use bi_next if this

 * bio happens to be merged with someone else, and may resubmit the bio to

 * a lower device by calling into generic_make_request recursively, which

 * means the bio should NOT be touched after the call to ->make_request_fn.

 */

blk_qc_t generic_make_request(struct bio *bio)

{

        /*

         * bio_list_on_stack[0] contains bios submitted by the current

         * make_request_fn.

         * bio_list_on_stack[1] contains bios that were submitted before

         * the current make_request_fn, but that haven't been processed

         * yet.

         */

        struct bio_list bio_list_on_stack[2];

        blk_mq_req_flags_t flags = 0;

        struct request_queue *q = bio->bi_disk->queue;//获取bio关联设备的队列

        blk_qc_t ret = BLK_QC_T_NONE;

 

        if (bio->bi_opf & REQ_NOWAIT)//判断bio是否是REQ_NOWAIT的,设置flags

                flags = BLK_MQ_REQ_NOWAIT;

        if (blk_queue_enter(q, flags) < 0) {//判断队列是否可以处理响应请求。

                if (!blk_queue_dying(q) && (bio->bi_opf & REQ_NOWAIT))

                        bio_wouldblock_error(bio);

                else

                        bio_io_error(bio);

                return ret;

        }

 

        if (!generic_make_request_checks(bio))//检测bio

                goto out;

 

        /*

         * We only want one ->make_request_fn to be active at a time, else

         * stack usage with stacked devices could be a problem.  So use

         * current->bio_list to keep a list of requests submited by a

         * make_request_fn function.  current->bio_list is also used as a

         * flag to say if generic_make_request is currently active in this

         * task or not.  If it is NULL, then no make_request is active.  If

         * it is non-NULL, then a make_request is active, and new requests

         * should be added at the tail

         */

        if (current->bio_list) {//current是描述进程的task_struct机构体,其中bio_list是 Stacked block device info(MD),如果是MD设备就添加到队列后退出了。

                bio_list_add(&current->bio_list[0], bio);

                goto out;

        }

 

        /* following loop may be a bit non-obvious, and so deserves some

         * explanation.

         * Before entering the loop, bio->bi_next is NULL (as all callers

         * ensure that) so we have a list with a single bio.

         * We pretend that we have just taken it off a longer list, so

         * we assign bio_list to a pointer to the bio_list_on_stack,

         * thus initialising the bio_list of new bios to be

         * added.  ->make_request() may indeed add some more bios

         * through a recursive call to generic_make_request.  If it

         * did, we find a non-NULL value in bio_list and re-enter the loop

         * from the top.  In this case we really did just take the bio

         * of the top of the list (no pretending) and so remove it from

         * bio_list, and call into ->make_request() again.

         */

        BUG_ON(bio->bi_next);

        bio_list_init(&bio_list_on_stack[0]);//初始化当前要提交的bio链表结构

        current->bio_list = bio_list_on_stack;//赋值给task_struct->bio_list,最后函数结束后会赋值为null.

        do {//循环处理bio,调用make_request_fn处理每个bio

                bool enter_succeeded = true;

 

                if (unlikely(q != bio->bi_disk->queue)) {//判断第一个bio关联的队列是否与上次make_request_fn函数提交的bio队列一致。

                        if (q)

                                blk_queue_exit(q);//减少队列引用,是blk_queue_enter逆操作

                        q = bio->bi_disk->queue; //从下一个bio中获取关联的队列

                        flags = 0;

                        if (bio->bi_opf & REQ_NOWAIT)

                                flags = BLK_MQ_REQ_NOWAIT;

                        if (blk_queue_enter(q, flags) < 0) {

                                enter_succeeded = false;

                                q = NULL;

                        }

                }

 

                if (enter_succeeded) {//成功放入队列后

                        struct bio_list lower, same;

 

                        /* Create a fresh bio_list for all subordinate requests */

                        bio_list_on_stack[1] = bio_list_on_stack[0];//上次make_request_fn提交的bios,赋值给bio_list_on_stack[1].

                        bio_list_init(&bio_list_on_stack[0]);//初始化这次需要提交的bios存放结构体bio_list_on_stack[0].

                        ret = q->make_request_fn(q, bio);//调用关键函数->make_request_fn

 

                        /* sort new bios into those for a lower level

                         * and those for the same level

                         */

                        bio_list_init(&lower);//初始化两个bio链表

                        bio_list_init(&same);

                        while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL)//循环处理这次提交的bios

                                if (q == bio->bi_disk->queue)

                                        bio_list_add(&same, bio);

                                else

                                        bio_list_add(&lower, bio);

                        /* now assemble so we handle the lowest level first */

                        bio_list_merge(&bio_list_on_stack[0], &lower);//进行合并。

                        bio_list_merge(&bio_list_on_stack[0], &same);

                        bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);

                } else {

                        if (unlikely(!blk_queue_dying(q) &&

                                        (bio->bi_opf & REQ_NOWAIT)))

                                bio_wouldblock_error(bio);

                        else

                                bio_io_error(bio);

                }

                bio = bio_list_pop(&bio_list_on_stack[0]);//获取下一个bio,继续处理

        } while (bio);

        current->bio_list = NULL; /* deactivate */

 

out:

        if (q)

                blk_queue_exit(q);

        return ret;

}

  1. 参考

一切不配参考链接的文章都是耍流氓。

https://lwn.net/Articles/736534/

https://lwn.net/Articles/738449/

https://www.thomas-krenn.com/en/wiki/Linux_Multi-Queue_Block_IO_Queueing_Mechanism_(blk-mq)

https://miuv.blog/2017/10/21/linux-block-mq-simple-walkthrough/

https://hyunyoung2.github.io/2016/09/14/Multi_Queue/

http://ari-ava.blogspot.com/2014/07/opw-linux-block-io-layer-part-4-multi.html

Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems

The multiqueue block layer

Blk-mq:new multi-queue block IO queueing mechnism

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

文章知识点与官方知识档案匹配,可进一步学习相关知识
CS入门技能树Linux入门初识Linux31215 人正在系统学习中

与[转帖]Linux块层技术全面剖析-v0.1相似的内容:

[转帖]Linux块层技术全面剖析-v0.1

Linux块层技术全面剖析-v0.1 perftrace@gmail.com 前言 网络上很多文章对块层的描述散乱在各个站点,而一些经典书籍由于更新不及时难免更不上最新的代码,例如关于块层的多队列。那么,是时候写一个关于linux块层的中文专题片章了,本文基于内核4.17.2。 因为文章中很多内容都

[转帖]Linux Storage Stack Diagram - Linux I/O系统

https://www.cnblogs.com/xuyaowen/p/linux-io-system.html 今天看到一篇文章,其中有几张图很有意思,进行记录一下,我相信如果你对IO子系统有初步了解的话,将会有一些收获: Linux 存储栈:涉及比较全面,分为文件系统层,块层,设备层三层; 对上图

[转帖]如何监测 Linux 的磁盘 I/O 性能

https://bbs.huaweicloud.com/blogs/379242 在我之前的文章:《探讨 Linux 的磁盘 I/O》中,我谈到了 Linux 磁盘 I/O 的工作原理,我们了解到 Linux 存储系统 I/O 栈由文件系统层(file system layer)、通用块层( gen

[转帖]linux块I/O总体概括

直接先上重点,linux中IO栈的完全图如下: 系统中能够随机访问固定大小数据片的硬件设备称作块设备。固定大小的数据片称为块。常见的块设备就是硬盘了。不能随机访问的就是字符设备了,管理块设备比字符设备要复杂很多。 块设备中最小的可寻址单元是扇区,一般是2的整数倍,最常见的是512字节。不过很多CD-

[转帖]如何提高Linux下块设备IO的整体性能?

http://www.yunweipai.com/6989.html 运维派隶属马哥教育旗下专业运维社区,是国内成立最早的IT运维技术社区,欢迎关注公众号:yunweipai领取学习更多免费Linux云计算、Python、Docker、K8s教程关注公众号:马哥linux运维 作者介绍 邹立巍 Li

[转帖]Linux磁盘I/O(一):Cache,Buffer和sync

Cache和Buffer的区别 磁盘是一个块设备,可以划分为不同的分区;在分区之上再创建文件系统,挂载到某个目录,之后才可以在这个目录中读写文件。Linux 中“一切皆文件”,我们平时查看的“文件”是普通文件,磁盘是块设备文件,我们可以通过执行 “ls -l <路径>” 查看它们的区别: $ ls

[转帖]Linux设备与内存单位-扇区、块、段、页(sector、block、segment、page)

每个概念是对不同的对象而言的,但它们有一定的联系 这些概念的分析背景是Linux下的内存页和磁盘结构 扇区 是硬盘等存储设备传送单位,大小一般为512B 块 是VFS和文件系统的传送单位(所以相关设备也成为块设备),大小必须是2的幂,不能超过页的大小。 段 是一个内存页或内存页的一部分,它包含磁盘上

[转帖]Linux网络命令之 `brctl`

文章目录 1 网桥的概念2 管理网桥的命令3 举例 1 网桥的概念 摘自百度百科: 网桥(Bridge)是早期的两端口二层网络设备,用来连接不同网段。 网桥是一种对帧进行转发的技术,根据 MAC 分区块,可隔离碰撞。网桥将网络的多个网段在数据链路层连接起来。 网桥也叫桥接器,是连接两个局域网的一种存

[转帖]Linux系统IO基准测试方法

https://www.cnblogs.com/wangzhen3798/p/13631848.html 顺序读写测试 主要关注磁盘的吞吐量,即每秒能够读入或者写出多少数据。普通单块机械磁盘顺序写在100MB/s左右,普通单块SSD的顺序写在500MB/s左右。该指标对MQ、ES等以append方式

[转帖]Linux文本处理三剑客之awk学习笔记05:getline用法详解

https://www.cnblogs.com/alongdidi/archive/2021/01/19/awkGetline.html getline用法详解 在默认情况下,awk支持从文件或者STDIN中读取数据。我们也可以使用getline来灵活读取数据,例如在main代码块执行过程中读取某个