Linux ixgbe 10G intel 网卡数据包处理流程

本文用ixgbe网卡驱动作为研究对象，linux版本是3.9.4基础知识：ixgbe_adapter/* board specific private data structure */struct ixgbe_adapter {//数据量太多，摘录部分看过比较有用的//发送的ringsstruct ixgbe_ring *tx_ring[MAX_TX_QU

maijian

4319人浏览 · 2013-06-06 16:18:59

maijian · 2013-06-06 16:18:59 发布

本文用ixgbe网卡驱动作为研究对象，linux版本是3.9.4

基础知识：

ixgbe_adapter

/* board specific private data structure */
struct ixgbe_adapter {

//数据量太多，摘录部分看过比较有用的

//发送的rings

struct ixgbe_ring *tx_ring[MAX_TX_QUEUES] ____cacheline_aligned_in_smp;

//接收的rings

struct ixgbe_ring *rx_ring[MAX_RX_QUEUES];

//这个vector里面包含了napi结构，不知道如何下定义这个q_vector

//应该是跟下面的entries一一对应起来做为是一个中断向量的东西吧

struct ixgbe_q_vector *q_vector[MAX_Q_VECTORS];

//这个里面估计是MSIX的多个中断对应的响应接口

struct msix_entry *msix_entries;

}

ixgbe_q_vector

struct ixgbe_q_vector {
        struct ixgbe_adapter *adapter;
ifdef CONFIG_IXGBE_DCA
        int cpu;            /* CPU for DCA */
#endif
        u16 v_idx;              /* index of q_vector within array, also used for
                                 * finding the bit in EICR and friends that
                                 * represents the vector for this ring */
        u16 itr;                /* Interrupt throttle rate written to EITR */
        struct ixgbe_ring_container rx, tx;

        struct napi_struct napi;//这个poll的接口实现是ixgbe_poll
       cpumask_t affinity_mask;
        int numa_node;
        struct rcu_head rcu;    /* to avoid race with update stats on free */
        char name[IFNAMSIZ + 9];

        /* for dynamic allocation of rings associated with this q_vector */
        struct ixgbe_ring ring[0] ____cacheline_internodealigned_in_smp;
};

softnet_data

/*
* Incoming packets are placed on per-cpu queues
*/
struct softnet_data {
    struct Qdisc        *output_queue;
    struct Qdisc        **output_queue_tailp;
    struct list_head    poll_list;
    struct sk_buff      *completion_queue;
    struct sk_buff_head process_queue;

    /* stats */
    unsigned int        processed;
    unsigned int        time_squeeze;
    unsigned int        cpu_collision;
    unsigned int        received_rps;

#ifdef CONFIG_RPS
    struct softnet_data *rps_ipi_list;

    /* Elements below can be accessed between CPUs for RPS */
    struct call_single_data csd ____cacheline_aligned_in_smp;
    struct softnet_data *rps_ipi_next;
    unsigned int        cpu;
    unsigned int        input_queue_head;
    unsigned int        input_queue_tail;
#endif
    unsigned int        dropped;
    struct sk_buff_head input_pkt_queue;
    struct napi_struct backlog;//cpu softnet_data的poll接口是process_backlog
};

napi_struct

/*
* Structure for NAPI scheduling similar to tasklet but with weighting
*/
struct napi_struct {
    /* The poll_list must only be managed by the entity which
     * changes the state of the NAPI_STATE_SCHED bit. This means
     * whoever atomically sets that bit can add this napi_struct
     * to the per-cpu poll_list, and whoever clears that bit
     * can remove from the list right before clearing the bit.
     */
    struct list_head    poll_list;

    unsigned long       state;
    int         weight;
    unsigned int        gro_count;
    int         (*poll)(struct napi_struct *, int);//poll的接口实现
#ifdef CONFIG_NETPOLL
    spinlock_t      poll_lock;
    int         poll_owner;
#endif
    struct net_device   *dev;
    struct sk_buff      *gro_list;
    struct sk_buff      *skb;
    struct list_head    dev_list;
};

enum {
    NAPI_STATE_SCHED,   /* Poll is scheduled */
    NAPI_STATE_DISABLE, /* Disable pending */
    NAPI_STATE_NPSVC,   /* Netpoll - don't dequeue from poll_list */
};

1.......................................................

文件 ixgbe_main.c
ixgbe_init_module注册一个ixgbe_driver的pci结构，其中我们关注ixgbe_probe接口，这个是网卡probe时的实现

2.......................................................

文件 ixgbe_main.c
ixgbe_probe这里关注3个事情

1.创建ixgbe_adapter的adapter结构，这个结果是网卡的一个实例，包含了网卡的所有数据及接口，包含了下面要

创建的netdev结构，还包含了每个中断号的响应数据结构接口（网卡是支持MSIX的），里面有一个叫q_vector的

数组，是ixbge_q_vector数据类型的（这个应该是一个中断向量的数据类型），每个元素是一个ixgbe_q_vector类型，

这个类似的数据结构包含了几样重要的东西，其中一样是napi_struct类型的成员napi，这个就包含了包含了ixgbe_poll

接口，主要是上层软中断调用的轮询接口；除此adapter还包含了一个重要的成员变量，struct msix_entry *msix_entries;

这个应该是MSIX没个通道对应的中断向量结构，下面的步骤4会描述

2.创建一个netdev结构net_device，net_device在linux里面代表是一个网络设备，然后绑定里面的netdev_ops接口

对应ixgbe_netdev_ops，这个结构有个响应网卡open的回调实现，ixgbe_open，按照open接口的描述

"Called when a network interface is made active" 网卡激活的时候被调用的，之前的probe接口是检测时被调用

3.ixgbe_init_interrupt_scheme这个函数主要是设置网卡的napi_struct结构的poll接口，这个poll接口实现是ixgbe_poll，

这个接口是面向内核层的设备poll包装，这里我们要区分清楚，cpu的softnet_data的napi结构poll的接口实现是

process_backlog，然而网卡ixgbe_q_vector里面napi结构poll的接口实现是ixgbe_poll，这个最终会在下面是否是

NAPI的调用方式中体现出来，参考步骤9

3.......................................................

文件 ixgbe_main.c
ixgbe_open这里主要有两个任务

1.根据queue初始化对应的ring buffer

ixgbe_setup_all_tx_resources(adapter); //设置发送队列的rings等等
ixgbe_setup_all_rx_resources(adapter); //设置接收队列的rings等等

2. ixgbe_request_irq根据中断类型设置中断响应机制(MSI/MSIX或者其它)一般好点的网卡都是支持MSIX的，

所以我们看里面的ixgbe_request_msix_irqs这个函数的实现

4.......................................................

文件 ixgbe_main.c
ixgbe_request_msix_irqs函数，因为MSIX是一个队列对应一个中断号，这里主要是对每个队列设置对应的中断响应接口，

（这里主要还是对adapter的q_vector进行设置，参考上面步骤2的第一点）

对应的接口是ixgbe_msix_clean_rings，这个函数会调用request_irq设置每个通道的硬中断实现为ixgbe_msix_clean_rings，

具体部分代码节选如下

for (vector = 0; vector < adapter->num_q_vectors; vector++) {

...........

struct ixgbe_q_vector *q_vector = adapter->q_vector[vector];
struct msix_entry *entry = &adapter->msix_entries[vector];
err = request_irq(entry->vector, &ixgbe_msix_clean_rings, 0,q_vector->name, q_vector);

...........

}

主要是把q_vector的中断向量实现接口与MSIX的一一对应起来

除了设置中断响应接口，这个函数还有设置IRQ的CPU affinity

5.......................................................

文件 ixgbe_main.c
ixgbe_msix_clean_rings这个函数就是网卡接收，发送数据时的中断响应接口，里面没做太多东西调用napi_schedule接口，

这里面的napi schedule有点折腾，要慢慢追踪下去，

static irqreturn_t ixgbe_msix_clean_rings(int irq, void *data)

{

struct ixgbe_q_vector *q_vector = data;

if (q_vector->rx.ring || q_vector->tx.ring)

napi_schedule(&q_vector->napi);//这里要注意，传进去的是qvector的napi结构，到时候调用的poll接口是ixgbe_poll

}

static inline void napi_schedule(struct napi_struct *n)
{
if (napi_schedule_prep(n))
__napi_schedule(n);
}

void __napi_schedule(struct napi_struct *n)
{
unsigned long flags;
local_irq_save(flags);
____napi_schedule(&__get_cpu_var(softnet_data), n);

//这里我们看到，把adapter的q_vector的napi结构放到当前运行cpu的napi结构的poll list队列去

//__get_cpu_var这个宏按照网上找到的信息，这个是获取当前运行CPU的softnet_data

//最终会有下面raise一个NET_RX_SOFTIRQ软中断叫CPU去处理自己的softnet_data里面的napi

//队列

local_irq_restore(flags);
}

____napi_schedule(struct softnet_data *sd,struct napi_struct *napi)

{

list_add_tail(&napi->poll_list, &sd->poll_list);

__raise_softirq_irqoff(NET_RX_SOFTIRQ);

}

当软中断响应后，数据包就从驱动层上升到内核层面的逻辑了，（注意前面的下划线个数，因为个数都有好几个的）

6.......................................................

文件 net/core/dev.c
最终napi_schedule会调用__raise_softirq_irqoff去触发一个软中断NET_RX_SOFTIRQ，然后又对应的软中断接口去实现往
上的协议栈逻辑
NET_RX_SOFTIRQ是收到数据包的软中断信号对应的接口是net_rx_action
NET_TX_SOFTIRQ是发送完数据包后的软中断信号对应的接口是net_tx_action

7.......................................................

文件 net/core/dev.c
net_rx_action主要是获取到cpu的softnet_data结构，然后开始迭代对softnet_Data里面的queue队列中的网络数据包进行处理，

这里会调用对应设备的napi_struct数据结构里面的poll接口。这个接口在ixgbe_probe里面设置好了，对应是ixgbe_poll函数。

net_rx_action函数会对目前设备是否已经处于NAPI_STATE_SCHED才会调用ixgbe_poll接口。

具体看看net_rx_action的代码节选

static void net_rx_action(struct softirq_action *h)

{

//这个是NAPI的工作方式函数，NAPI工作模式下关闭硬中断，在2个时钟周期内poll收到预期数据包budget个数

//否则重新开启硬中断，跳出poll轮询

unsigned long time_limit = jiffies + 2;
int budget = netdev_budget;

struct softnet_data *sd = &__get_cpu_var(softnet_data);

local_irq_disable();

while (!list_empty(&sd->poll_list)) {
struct napi_struct *n;
int work, weight;

......

/* If softirq window is exhuasted then punt.
   * Allow this to run for 2 jiffies since which will allow
   * an average latency of 1.5/HZ.
   */

//在指定2个时钟周期内收到budget个数据包，否则跳出poll轮询，这就是NAPI的工作方式

//减少中断调用次数获取数据包

if (unlikely(budget <= 0 || time_after_eq(jiffies, time_limit)))
goto softnet_break;
local_irq_enable();

......

//获取当前CPU的softnet_data中napi的poll list列表

n = list_first_entry(&sd->poll_list, struct napi_struct, poll_list);

if (test_bit(NAPI_STATE_SCHED, &n->state)) {

work = n->poll(n, weight);//然后就开始调用napi_struct结构的poll接口，这个接口的实现就是上面步骤2的第三小点描述的

//其实这里的poll调用就是调用了ixgbe_poll

}

......

}

8.......................................................

文件 ixgbe_main.c
ixgbe_poll函数主要做3个事情
1.对tx发送数据包队列进行处理 ixgbe_clean_tx_irq
2.对rx接收数据包队列进行处理 ixgbe_clean_rx_irq
3.最后通知napi_complete

9.......................................................

文件 ixgbe_main.c
ixgbe_clean_rx_irq函数把ring buffer的内容取出来转成sk_buff包，然后提交到ixgbe_rx_skb，ixgbe_rx_skb会根据

当前是否是NETPOLL工作模式来区分调用接口，NETPOLL是一种用于调试的工作模式，我们暂不深入研究

napi_gro_receive会往上调用napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)，然而这个传进来的napi

结构正是qvector的结构，poll接口是ixgbe_poll实现的

部分代码节选如下

static void ixgbe_rx_skb(struct ixgbe_q_vector *q_vector, struct sk_buff *skb)
{
        struct ixgbe_adapter *adapter = q_vector->adapter;

        if (!(adapter->flags & IXGBE_FLAG_IN_NETPOLL))
                napi_gro_receive(&q_vector->napi, skb);
        else
                netif_rx(skb); //这里是NETPOLL的方式，
}

10.......................................................

文件 net/core/dev.c

netif_receive_skb接口在新版内核里面做了两件事情
1.google的RPS机制会在这里调度CPU

2.往上层调用接口，从__netif_receive_skb_core再调用到deliver_skb接口，这个deliver_skb接口会调用一个叫packet_type数据

结构里面的func接口，这个是数据包接收下层提交到协议栈的接口，数据也就由此进入了TCP/IP协议栈了

11.......................................................

文件 net/ipv4/af_inet.c ip_input.c
这个是ipv4的AF_INET协议模块，在inet_init初始化时会注册一个ip_packet_type的数据结构，这个结果就是packet_type类型的，

里面指定了数据接收的函数func回调接口的实现ip_rcv。除此之外还注册了2个重要的net_protocol的协议接口tcp_protocol,udp_protocol，这个后面的ip_local_deliver会用到

ip_rcv的函数就开始了TCP/IP协议栈的工作流程，这里主要是做一些IP包的分析，最后调用netfilter的PRE_ROUTING hook，

通知完netfilter的hook之后,假如这个包ACCEPT的话会调用ip_rcv_finish

12.......................................................

文件 net/ipv4/ip_input.c route.c
这个ip_rcv_finish函数它会调用ip_route_input_noref，最后就会调用一个dst_input，然后这个dst_input其实是调用一个input的

函数指针，这个数据包的input函数指针则是接下来往上层的路径入口，这个input函数指针在那里设置，就是在之前

ip_route_input_noref调用里面设置好了。如果是本机数据则input接口为ip_local_deliver

13.......................................................

文件 net/ipv4/ip_input.c af_inet.c
ip_local_deliver首先会调用netfilter的NF_INET_LOCAL_IN, 如果没问题则调用下一个 ip_local_deliver_finish，这里会获取该IP

数据包的协议结构net_protocol，然后调用net_protocol里面的handler接口

net_protocol在之前的af_inet.c里面的inet_init已经注册好了

tcp协议对应的handler为tcp_v4_rcv
udp协议对应的handler为udp_rcv

14.......................................................

文件 net/ipv4/udp.c
udp_rcv将UDP包放到sock结构sk里面的sk_receive_queue的接收队列里面