关于SiamRPN代码的一些要点

关于SiamRPN代码的一些要点1、Architecture2、generate anchors3、labelspaper：High Performance Visual Tracking with Siamese Region Proposal Networkcode：https://github.com/STVIR/pysot【另一个包括训练测试的复现代码】SiamRPN是siamese系列的

文章共7,951字 · 阅读需要大约27分钟

一键AI生成摘要，助你高效阅读

问答

laizi_laizi

6265人浏览 · 2020-08-29 00:27:38

laizi_laizi · 2020-08-29 00:27:38 发布

paper：High Performance Visual Tracking with Siamese Region Proposal Network
code：https://github.com/STVIR/pysot【另一个包括训练测试的复现代码】
我自己根据pysot修改的一个易读的代码：https://github.com/laisimiao/siamrpn.pytorch
SiamRPN是siamese系列的跟踪经典之作，又是最早的anchor-based方法，所以当你看懂SiamRPN一些操作之后，对后续的DaSiamRPN, SiamMask,SiamRPN++都是有帮助的。所以在此记录一下SiamRPN代码中的几个要点：
SiamRPN架构图

1、Architecture

虽然看完上面那幅图以后整体模型流程已经非常清晰了：

1、template frame和detection frame经过相同的Siamese Network得到一个feature,然后经过RPN的classification branch和regression branch，其中template作为kernel在detection上做correlation操作。
2、分类分支的作用就是预测原图上的哪些anchor会与目标的IoU大于一定的阈值，他们对应最后的feature map上的点就是 1；回归分支就是预测每个anchor与target box的xywh的偏移

但是在具体编程的时候，模型构建还是有一点点不一样（如上图划线后红色字所示），但这样都不影响预测出来的tensor的shape，又再一次说明了CNN的黑箱性【只要能train work就行，然后加以一点解释】，下面是代码体现，主要在get_rpn_head中的pysot/models/head/rpn.py:

class DepthwiseXCorr(nn.Module):
    def __init__(self, in_channels, hidden, out_channels, kernel_size=3, hidden_kernel_size=5):
    # in_channels:256, hidden:256, out_channels:2*K(K is number of anchors)
        super(DepthwiseXCorr, self).__init__()
        self.conv_kernel = nn.Sequential(
                nn.Conv2d(in_channels, hidden, kernel_size=kernel_size, bias=False),
                nn.BatchNorm2d(hidden),
                nn.ReLU(inplace=True),
                )
        self.conv_search = nn.Sequential(
                nn.Conv2d(in_channels, hidden, kernel_size=kernel_size, bias=False),
                nn.BatchNorm2d(hidden),
                nn.ReLU(inplace=True),
                )
        self.head = nn.Sequential(
                nn.Conv2d(hidden, hidden, kernel_size=1, bias=False),
                nn.BatchNorm2d(hidden),
                nn.ReLU(inplace=True),
                nn.Conv2d(hidden, out_channels, kernel_size=1)
                )
        

    def forward(self, kernel, search):
        kernel = self.conv_kernel(kernel)
        search = self.conv_search(search)
        feature = xcorr_depthwise(search, kernel)
        out = self.head(feature)
        return out

2、Generate Anchors

之前一直知道在原图（也就是上图中的detection frame）撒anchors，但是看过以后才知道，这也涉及到feature map上的点映射到original image上，然后以此映射回来的点为中心撒不同scale和不同aspect ratio的anchor【看来不管是anchor-based还是anchor-free这一步都是不可缺的】，看看代码是怎样实现的，主要是在pysot/utils/anchor.py中的generate_all_anchors：

    def generate_all_anchors(self, im_c, size):
        """
        im_c: image center -> cfg.TRAIN.SEARCH_SIZE//2
        size: image size   -> cfg.TRAIN.OUTPUT_SIZE
        """
        if self.image_center == im_c and self.size == size:
            return False
        self.image_center = im_c
        self.size = size

        a0x = im_c - size // 2 * self.stride
        ori = np.array([a0x] * 4, dtype=np.float32)
        # 这里的self.anchors就是一个位置上的K个anchors
        # ori就是detection frame上映射回来的最左上角的位置
        zero_anchors = self.anchors + ori

        x1 = zero_anchors[:, 0]
        y1 = zero_anchors[:, 1]
        x2 = zero_anchors[:, 2]
        y2 = zero_anchors[:, 3]

        x1, y1, x2, y2 = map(lambda x: x.reshape(self.anchor_num, 1, 1),
                             [x1, y1, x2, y2])
        cx, cy, w, h = corner2center([x1, y1, x2, y2])

        disp_x = np.arange(0, size).reshape(1, 1, -1) * self.stride
        disp_y = np.arange(0, size).reshape(1, -1, 1) * self.stride

        cx = cx + disp_x
        cy = cy + disp_y

        # broadcast
        zero = np.zeros((self.anchor_num, size, size), dtype=np.float32)
        cx, cy, w, h = map(lambda x: x + zero, [cx, cy, w, h])
        x1, y1, x2, y2 = center2corner([cx, cy, w, h])

        self.all_anchors = (np.stack([x1, y1, x2, y2]).astype(np.float32),
                            np.stack([cx, cy, w,  h]).astype(np.float32))
        return True

下面这幅图就是我画出来的一幅示意图(左边是detection frame，右边是覆盖了所有的anchors的图)：
在这里插入图片描述

3、Labels

下面就是论文中对于标签的描述：
预测值的标签
下面就看一下代码中怎么实现的，主要是pysot/datasets/anchor_target.py中的__call__方法：

    def __call__(self, target, size, neg=False):
        anchor_num = len(cfg.ANCHOR.RATIOS) * len(cfg.ANCHOR.SCALES)

        # -1 ignore 0 negative 1 positive
        cls = -1 * np.ones((anchor_num, size, size), dtype=np.int64)
        delta = np.zeros((4, anchor_num, size, size), dtype=np.float32)
        delta_weight = np.zeros((anchor_num, size, size), dtype=np.float32)

        def select(position, keep_num=16):
            num = position[0].shape[0]
            if num <= keep_num:
                return position, num
            slt = np.arange(num)
            np.random.shuffle(slt)
            slt = slt[:keep_num]
            return tuple(p[slt] for p in position), keep_num

        tcx, tcy, tw, th = corner2center(target)

        if neg:
            # l = size // 2 - 3
            # r = size // 2 + 3 + 1
            # cls[:, l:r, l:r] = 0

            cx = size // 2
            cy = size // 2
            cx += int(np.ceil((tcx - cfg.TRAIN.SEARCH_SIZE // 2) /
                      cfg.ANCHOR.STRIDE + 0.5))
            cy += int(np.ceil((tcy - cfg.TRAIN.SEARCH_SIZE // 2) /
                      cfg.ANCHOR.STRIDE + 0.5))
            l = max(0, cx - 3)
            r = min(size, cx + 4)
            u = max(0, cy - 3)
            d = min(size, cy + 4)
            cls[:, u:d, l:r] = 0

            neg, neg_num = select(np.where(cls == 0), cfg.TRAIN.NEG_NUM)
            cls[:] = -1
            cls[neg] = 0

            overlap = np.zeros((anchor_num, size, size), dtype=np.float32)
            return cls, delta, delta_weight, overlap

        anchor_box = self.anchors.all_anchors[0]
        anchor_center = self.anchors.all_anchors[1]
        x1, y1, x2, y2 = anchor_box[0], anchor_box[1], \
            anchor_box[2], anchor_box[3]
        cx, cy, w, h = anchor_center[0], anchor_center[1], \
            anchor_center[2], anchor_center[3]

        delta[0] = (tcx - cx) / w
        delta[1] = (tcy - cy) / h
        delta[2] = np.log(tw / w)
        delta[3] = np.log(th / h)

        overlap = IoU([x1, y1, x2, y2], target)

        pos = np.where(overlap > cfg.TRAIN.THR_HIGH)
        neg = np.where(overlap < cfg.TRAIN.THR_LOW)

        pos, pos_num = select(pos, cfg.TRAIN.POS_NUM)
        neg, neg_num = select(neg, cfg.TRAIN.TOTAL_NUM - cfg.TRAIN.POS_NUM)

        cls[pos] = 1
        delta_weight[pos] = 1. / (pos_num + 1e-6)

        cls[neg] = 0
        return cls, delta, delta_weight, overlap

下面是对cls标签的可视化，K(K=5)个channel分别画出来，而delta回归分支因为维度太高不容易可视化【黄色为1，紫色为0，蓝绿色为-1】：
在这里插入图片描述

下面是对某个anchor（特定的一个K）的四个channel的标签进行的可视化，因为w offset和h offset是负值，所以一片紫色：
在这里插入图片描述

4、Losses

下图就是论文中提到损失函数的部分：可以看到：

分类分支利用交叉熵损失来监督预测值，使得在目标周围与target IoU大于一定阈值的anchor对应的feature map上的位置能够是1，能够在track phase更容易选中这些anchor，去更可靠地回归出target位置
回归分支这里公式写的不是很严谨， $\delta$ 这里其实学习的目标，也就是anchor和target的偏差，包括归一化后的cx,cy,w,h偏差，而我们就是去预测这个offsets，所以利用了smooth L1 loss

在这里插入图片描述
现在来看看代码怎么实现的，入口就是pysot/models/model_builder.py中的forward方法：

cls_loss = select_cross_entropy_loss(cls, label_cls)
loc_loss = weight_l1_loss(loc, label_loc, label_loc_weight)

4.1、Cross Entropy Loss

一看代码就是经典的二分类交叉熵损失函数，只不过需要注意三点:

这里的pred已经经过F.log_softmax函数，所以这里只要经过 F.nll_loss就行了
因为label里面还有ignored -1部分，所以这里就选取1的正位置和0的负位置部分计入损失
从上面的3、Labels部分看到如果对于neg pairs就只有0和-1，在计算分类loss的时候就只监督负样本的预测就好了，这时也没有回归损失了

def get_cls_loss(pred, label, select):
    if len(select.size()) == 0 or \
            select.size() == torch.Size([0]):
        return 0
    pred = torch.index_select(pred, 0, select)
    label = torch.index_select(label, 0, select)
    return F.nll_loss(pred, label)

def select_cross_entropy_loss(pred, label):
    """
    :param pred: (N,K,17,17,2)
    :param label: (N,K,17,17)
    :return:
    """
    pred = pred.view(-1, 2)
    label = label.view(-1)
    pos = label.data.eq(1).nonzero().squeeze().cuda()  # (#pos,)
    neg = label.data.eq(0).nonzero().squeeze().cuda()  # (#neg,)
    loss_pos = get_cls_loss(pred, label, pos)
    loss_neg = get_cls_loss(pred, label, neg)
    return loss_pos * 0.5 + loss_neg * 0.5

4.2、L1 Loss

代码里面并没有使用smooth L1 loss，而是直接使用了L1 loss，即 $loss=\frac{\sum_{n=1}^{n}\left|f\left(x_{i}\right)-y_{i}\right|}{n}$ ，这里也有注意的点：

这里的loss_weight其实也就没有把负位置的点算入回归损失，并且对正位置出损失做了归一化
L1 loss不要忘记最后除以batch size

def weight_l1_loss(pred_loc, label_loc, loss_weight):
    """
    :param pred_loc: (N,4K,17,17)
    :param label_loc: (N,4,k,17,17)
    :param loss_weight: (N,K,17,17)
    :return:
    """
    b, _, sh, sw = pred_loc.size()
    pred_loc = pred_loc.view(b, 4, -1, sh, sw)
    diff = (pred_loc - label_loc).abs()
    diff = diff.sum(dim=1).view(b, -1, sh, sw)
    loss = diff * loss_weight
    return loss.sum().div(b)

5、Track Phase

这一部分在pysot\tracker\siamrpn_tracker.py里，主要实现两个方法：init和track:

5.1、 init

这部分就是利用第一帧的先验信息，包括第一帧图片和ground truth bbox，相当于一个one-shot detection，这个template frame就固定了，相当于一个kernel

Track Phase

    def init(self, img, bbox):
        """
        args:
            img(np.ndarray): BGR image
            bbox: (x, y, w, h) bbox
        """
        # 之后要更新self.center_pos和self.size
        self.center_pos = np.array([bbox[0]+(bbox[2]-1)/2,
                                    bbox[1]+(bbox[3]-1)/2])
        self.size = np.array([bbox[2], bbox[3]])

        # calculate z crop size
        w_z = self.size[0] + cfg.TRACK.CONTEXT_AMOUNT * np.sum(self.size)
        h_z = self.size[1] + cfg.TRACK.CONTEXT_AMOUNT * np.sum(self.size)
        s_z = round(np.sqrt(w_z * h_z))

        # calculate channle average
        self.channel_average = np.mean(img, axis=(0, 1))

        # get crop
        z_crop = self.get_subwindow(img, self.center_pos,
                                    cfg.TRACK.EXEMPLAR_SIZE,
                                    s_z, self.channel_average)
        self.model.template(z_crop)

5.2、 track

这里就是输入一张subsequent frame，然后根据预测值，加以scale和ratio的penalty，然后用cosine window来suppress large displacement,然后根据分类分数的最高值对应的anchor来回归预测目标位置：

    def track(self, img):
        """
        args:
            img(np.ndarray): BGR image
        return:
            bbox(list):[x, y, width, height]
        """
        w_z = self.size[0] + cfg.TRACK.CONTEXT_AMOUNT * np.sum(self.size)
        h_z = self.size[1] + cfg.TRACK.CONTEXT_AMOUNT * np.sum(self.size)
        s_z = np.sqrt(w_z * h_z)
        scale_z = cfg.TRACK.EXEMPLAR_SIZE / s_z
        s_x = s_z * (cfg.TRACK.INSTANCE_SIZE / cfg.TRACK.EXEMPLAR_SIZE)
        x_crop = self.get_subwindow(img, self.center_pos,
                                    cfg.TRACK.INSTANCE_SIZE,
                                    round(s_x), self.channel_average)

        outputs = self.model.track(x_crop)

        score = self._convert_score(outputs['cls'])
        pred_bbox = self._convert_bbox(outputs['loc'], self.anchors)

        def change(r):
            return np.maximum(r, 1. / r)

        def sz(w, h):
            pad = (w + h) * 0.5
            return np.sqrt((w + pad) * (h + pad))

        # scale penalty
        s_c = change(sz(pred_bbox[2, :], pred_bbox[3, :]) /
                     (sz(self.size[0]*scale_z, self.size[1]*scale_z)))

        # aspect ratio penalty
        r_c = change((self.size[0]/self.size[1]) /
                     (pred_bbox[2, :]/pred_bbox[3, :]))
        penalty = np.exp(-(r_c * s_c - 1) * cfg.TRACK.PENALTY_K)
        pscore = penalty * score

        # window penalty
        pscore = pscore * (1 - cfg.TRACK.WINDOW_INFLUENCE) + \
            self.window * cfg.TRACK.WINDOW_INFLUENCE
        best_idx = np.argmax(pscore)

        bbox = pred_bbox[:, best_idx] / scale_z
        lr = penalty[best_idx] * score[best_idx] * cfg.TRACK.LR

        cx = bbox[0] + self.center_pos[0]
        cy = bbox[1] + self.center_pos[1]

        # smooth bbox
        width = self.size[0] * (1 - lr) + bbox[2] * lr
        height = self.size[1] * (1 - lr) + bbox[3] * lr

        # clip boundary
        cx, cy, width, height = self._bbox_clip(cx, cy, width,
                                                height, img.shape[:2])

        # udpate state
        self.center_pos = np.array([cx, cy])
        self.size = np.array([width, height])

        bbox = [cx - width / 2,
                cy - height / 2,
                width,
                height]
        best_score = score[best_idx]
        return {
                'bbox': bbox,
                'best_score': best_score
               }