MyDLNote - Network: Deep High-Resolution Representation Learning for Human Pose Estimation

Deep High-Resolution Representation Learning for Human Pose Estimation[paper]https://arxiv.org/pdf/1902.09212.pdf[github]https://github.com/leoxiaobin/deep-high-resolution-net.pytorchTable of ...

文章共3,642字 · 阅读需要大约13分钟

一键AI生成摘要，助你高效阅读

问答

Phoenixtree_DongZhao

16187人浏览 · 2019-11-07 22:26:51

Phoenixtree_DongZhao · 2019-11-07 22:26:51 发布

Deep High-Resolution Representation Learning for Human Pose Estimation

[paper] https://arxiv.org/pdf/1902.09212.pdf

[github] https://github.com/leoxiaobin/deep-high-resolution-net.pytorch

Table of Contents

Deep High-Resolution Representation Learning for Human Pose Estimation

Abstract

Introduction

High-to-low and low-to-high

Multi-scale fusion

Intermediate supervision.

Our approach

Approach

Sequential multi-resolution subnetworks

Parallel multi-resolution subnetworks

Repeated multi-scale fusion

Heatmap estimation

Network instantiation

Experiments

Ablation Study

Abstract

In this paper, we are interested in the human pose estimation problem with a focus on learning reliable highresolution representations. Most existing methods recover high-resolution representations from low-resolution representations produced by a high-to-low resolution network. Instead, our proposed network maintains high-resolution representations through the whole process.

为了获取高分辨率表达，most existing 方法是low-to-high的方式。

本文提出的是high-to-low的方式。

We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutli-resolution subnetworks in parallel. We conduct repeated multi-scale fusions such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich high-resolution representations.

本文的网络总体：

1. first stage：high-resolution subnetwork

2. more stages：high-to-low resolution subnetworks

3. connect：并行的、重复的多分辨率、多尺度融合，即每个stage的low-resolution都要和first stage的high-resolution进行融合。

Introduction

Most existing methods pass the input through a network, typically consisting of high-to-low resolution subnetworks that are connected in series, and then raise the resolution. For instance, Hourglass [40] recovers the high resolution through a symmetric low-to-high process. SimpleBaseline [72] adopts a few transposed convolution layers for generating high-resolution representations. In addition, dilated convolutions are also used to blow up the later layers of a high-to-low resolution network (e.g., VGGNet or ResNet) [27, 77].

现有的方法普遍都是high-to-low的结构：如 Hourglass [Stacked hourglass networks for human pose estimation. In ECCV2016]，[Simple baselines for human pose estimation and tracking. In ECCV2018]。

膨胀卷积也是这样的，如dilated residual network，就是在ResNet的后面几层，用膨胀卷积保持分辨率不变，从而实现low-to-high resolution的变换。

We present a novel architecture, namely HighResolution Net (HRNet), which is able to maintain high-resolution representations through the whole process. We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the multi-resolution subnetworks in parallel. We conduct repeated multi-scale fusions by exchanging the information across the parallel multi-resolution subnetworks over and over through the whole process. We estimate the keypoints over the highresolution representations output by our network. The resulting network is illustrated in Figure 1.

这段和摘要几乎一样 =.=|

Figure 1. Illustrating the architecture of the proposed HRNet. It consists of parallel high-to-low resolution subnetworks with repeated information exchange across multi-resolution subnetworks (multi-scale fusion). The horizontal and vertical directions correspond to the depth of the network and the scale of the feature maps, respectively.

Our network has two benefits in comparison to existing widely-used networks [40, 27, 77, 72] for pose estimation. (i) Our approach connects high-to-low resolution subnetworks in parallel rather than in series as done in most existing solutions. Thus, our approach is able to maintain the high resolution instead of recovering the resolution through a low-to-high process, and accordingly the predicted heatmap is potentially spatially more precise. (ii) Most existing fusion schemes aggregate low-level and high-level representations. Instead, we perform repeated multiscale fusions to boost the high-resolution representations with the help of the low-resolution representations of the same depth and similar level, and vice versa, resulting in that high-resolution representations are also rich for pose estimation. Consequently, our predicted heatmap is potentially more accurate.

两个优势：（i）并行结构比串行结构能更好学习high-level representations。（ii）在多个尺度（多个stage）下，high-level 与 low-level 间相互融合，能更好学习high-level representations。

There are two mainstream methods: regressing the position of keypoints [66, 7], and estimating keypoint heatmaps [13, 14, 78] followed by choosing the locations with the highest heat values as the keypoints.

姿态检测有两种mainstream：直接返回关键点和估计关键点的热图。

Most convolutional neural networks for keypoint heatmap estimation consist of a stem subnetwork similar to the classification network, which decreases the resolution, a main body producing the representations with the same resolution as its input, followed by a regressor estimating the heatmaps where the keypoint positions are estimated and then transformed in the full resolution. The main body mainly adopts the high-to-low and low-to-high framework, possibly augmented with multi-scale fusion and intermediate (deep) supervision.

传统的 CNN based 关键点热图估计采用的网络和分类网络相似：high-to-low and low-to-high framework。

High-to-low and low-to-high

The high-to-low process aims to generate low-resolution and high-level representations, and the low-to-high process aims to produce high-resolution representations [4, 11, 23, 72, 40, 62]. Both the two processes are possibly repeated several times for boosting the performance [77, 40, 14].

high-to-low：low-resolution and high-level representations

low-to-high：high-resolution representations

为了提高性能，这两个过程可能会重复几次。

Representative network design patterns include: (i) Symmetric high-to-low and low-to-high processes. Hourglass and its follow-ups [40, 14, 77, 31] design the low-to-high process as a mirror of the high-to-low process. (ii) Heavy high-to-low and light low-to-high. The high-to-low process is based on the ImageNet classification network, e.g., ResNet adopted in [11, 72], and the low-to-high process is simply a few bilinear-upsampling [11] or transpose convolution [72] layers. (iii) Combination with dilated convolutions. In [27, 51, 35], dilated convolutions are adopted in the last two stages in the ResNet or VGGNet to eliminate the spatial resolution loss, which is followed by a light low-to-high process to further increase the resolution, avoiding expensive computation cost for only using dilated convolutions [11, 27, 51]. Figure 2 depicts four representative pose estimation networks.

三类典型网络：

对称 high-to-low and low-to-high；

重 high-to-low，轻low-to-high；

结合 dilated conv. (也是重 high-to-low，轻low-to-high)

Figure 2. Illustration of representative pose estimation networks that rely on the high-to-low and low-to-high framework. (a) Hourglass [40]. (b) Cascaded pyramid networks [11]. (c) Simple Baseline [72]: transposed convolutions for low-to-high processing. (d) Combination with dilated convolutions [27].

Bottom-right legend: reg. = regular convolution, dilated = dilated convolution, trans. = transposed convolution, strided = strided convolution, concat. = concatenation.

In (a), the high-to-low and low-to-high processes are symmetric. In (b), (c) and (d), the high-to-low process, a part of a classification network (ResNet or VGGNet), is heavy, and the low-to-high process is light.

In (a) and (b), the skip-connections (dashed lines) between the same-resolution layers of the high-to-low and low-to-high processes mainly aim to fuse low-level and high-level features. In (b), the right part, refinenet, combines the low-level and high-level features that are processed through convolutions.

列了四种典型的姿态估计网络模型：

(a) Hourglass [Stacked hourglass networks for human pose estimation].

(b) Cascaded pyramid networks [Cascaded Pyramid Network for Multi-Person Pose Estimation].

(c) Simple Baseline [Simple baselines for human pose estimation and tracking].

(d) Combination with dilated convolutions [Deepercut: A deeper, stronger, and faster multiperson pose estimation model].

(a) 对称网络。（b）（c）（d）非对称：high-to-low is heavy; low-to-high is light.

Multi-scale fusion

The straightforward way is to feed multi-resolution images separately into multiple networks and aggregate the output response maps [64]. Hourglass [40] and its extensions [77, 31] combine low-level features in the high-to-low process into the same-resolution high-level features in the low-to-high process progressively through skip connections. In cascaded pyramid network [11], a globalnet combines low-to-high level features in the high-to-low process progressively into the low-tohigh process, and then a refinenet combines the low-to-high level features that are processed through convolutions. Our approach repeats multi-scale fusion, which is partially inspired by deep fusion and its extensions [Deeply-fused nets, Interleaved structured sparse convolutional neural networks, IGCV v1, IGCV v3].

多尺度融合在网络中的实现形式有 skip-connection；globalnet；deep fusion net。本文重复使用 deep fusion net。

Intermediate supervision.

Intermediate supervision or deep supervision, early developed for image classification [34, 61], is also adopted for helping deep networks training and improving the heatmap estimation quality, e.g., [69, 40, 64, 3, 11]. The hourglass approach [40] and the convolutional pose machine approach [69] process the intermediate heatmaps as the input or a part of the input of the remaining subnetwork.

Our approach

Our network connects high-to-low subnetworks in parallel. It maintains high-resolution representations through the whole process for spatially precise heatmap estimation. It generates reliable high-resolution representations through repeatedly fusing the representations produced by the high-to-low subnetworks. Our approach is different from most existing works, which need a separate low-to-high upsampling process and aggregate low-level and high-level representations. Our approach, without using intermediate heatmap supervision, is superior in keypoint detection accuracy and efficient in computation complexity and parameters.

几个重点：

1. 并行连结 high-to-low 子网络；

2. 高分辨特征图一直存在；

3. 反复融合 high-to-low 的特征；

4. 没有使用中间层热图监督。

There are related multi-scale networks for classification and segmentation [5, 8, 74, 81, 30, 76, 55, 56, 24, 83, 55, 52, 18]. Our work is partially inspired by some of them [56, 24, 83, 55], and there are clear differences making them not applicable to our problem. Convolutional neural fabrics [56] and interlinked CNN [83] fail to produce high-quality segmentation results because of a lack of proper design on each subnetwork (depth, batch normalization) and multi-scale fusion. The grid network [18], a combination of many weight-shared U-Nets, consists of two separate fusion processes across multi-resolution representations: on the first stage, information is only sent from high resolution to low resolution; on the second stage, information is only sent from low resolution to high resolution, and thus less competitive. Multi-scale densenets [24] does not target and cannot generate reliable high-resolution representations.

跟四个网络做了比较。本文的网络，跟下面的网络多少有些联系。

Convolutional neural fabrics [Convolutional Neural Fabrics]

interlinked CNN [Interlinked Convolutional Neural Networks for Face Parsing]

grid network [Residual Conv-Deconv Grid Network for Semantic Segmentation]

Multi-scale densenets [Multi-Scale Dense Convolutional Networks for Efficient Prediction]

Approach

Human pose estimation, a.k.a. (also known as) keypoint detection, aims to detect the locations of $\small K$ keypoints or parts (e.g., elbow, wrist, etc) from an image $\small I$ of size W × H × 3. The stateof-the-art methods transform this problem to estimating $\small K$ heatmaps of size $\small {W}'\times {H}'$ , $\small \{H_1, H_2, ..., H_K\}$ , where each heatmap Hk indicates the location confidence of the $\small k$ th keypoint.

We follow the widely-adopted pipeline [40, 72, 11] to predict human keypoints using a convolutional network, which is composed of a stem consisting of two strided convolutions decreasing the resolution, a main body outputting the feature maps with the same resolution as its input feature maps, and a regressor estimating the heatmaps where the keypoint positions are chosen and transformed to the full resolution. We focus on the design of the main body and introduce our High-Resolution Net (HRNet) that is depicted in Figure 1.

这段介绍 pipeline：降维 - 特征变换（输出和输入的分辨率相同） - 热图。

Sequential multi-resolution subnetworks

Let $\small \mathcal{N}_{sr}$ be the subnetwork in the $\small s$ th stage and $\small r$ be the resolution index (Its resolution is $\small 1/2^{r-1}$ of the resolution of the first subnetwork). The high-to-low network with $\small S$ (e.g., 4) stages can be denoted as:

Parallel multi-resolution subnetworks

We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one, forming new stages, and connect the multi-resolution subnetworks in parallel. As a result, the resolutions for the parallel subnetworks of a later stage consists of the resolutions from the previous stage, and an extra lower one. An example network structure, containing 4 parallel subnetworks, is given as follows,

这两节给出了两种多尺度分辨率网络结构。前者是通常encoder的串联结构。后者就是并行结构。

Repeated multi-scale fusion

We introduce exchange units across parallel subnetworks such that each subnetwork repeatedly receives the information from other parallel subnetworks. Here is an example showing the scheme of exchanging information. We divided the third stage into several (e.g., 3) exchange blocks, and each block is composed of 3 parallel convolution units with an exchange unit across the parallel units, which is given as follows,

where $\small C^b_{sr}$ represents the convolution unit in the $\small r$ th resolution of the $\small b$ th block in the $\small s$ th stage, and $\small \varepsilon ^b_s$ is the corresponding exchange unit.

文章并不是简单用并行结构，而是考虑了不同尺度间特征图的融合过程。

融合过程就公式（3）的结构。

We illustrate the exchange unit in Figure 3 and present the formulation in the following. We drop the subscript $\small s$ and the superscript $\small b$ for discussion convenience. The inputs are $\small s$ response maps: $\small \{H_1, H_2, ..., H_s\}$ . The outputs are s response maps: $\small \{Y_1, Y_2, ..., Y_s\}$ , whose resolutions and widths are the same to the input. Each output is an aggregation of the input maps, $\small Y_k=\sum^s_{i=1}a(X_i,k)$ . The exchange unit across stages has an extra output map $\small Y_{s+1}=a(Y_s, s+1)$ .

Figure 3. Illustrating how the exchange unit aggregates the information for high, medium and low resolutions from the left to the right, respectively. Right legend: strided 3×3 = strided 3×3 convolution, up samp. 1×1 = nearest neighbor up-sampling following a 1 × 1 convolution.

公式 (3) 的具体网络实现即图3所示。注意：当融合的两subnetwork跨越了一个stage，则会多生成一层与被跨越的那层相同分辨率的特征图。

The function $\small a(X_i,k)$ consists of upsampling or downsampling $\small X_i$ from resolution $\small i$ to resolution $\small k$ . We adopt strided 3 × 3 convolutions for downsampling. For instance, one strided 3×3 convolution with the stride 2 for 2× downsampling, and two consecutive strided 3 × 3 convolutions with the stride 2 for 4× downsampling. For upsampling, we adopt the simple nearest neighbor sampling following a 1 × 1 convolution for aligning the number of channels. If $\small i=k,a(\cdot,\cdot)$ is just an identify connection: $\small a(X_i,k)=X_i$ .

这段解释降维和升维的实现：降维用 stride=2的3x3 conv.；升维用最邻近上采样+1x1 conv.实现。

Heatmap estimation

We regress the heatmaps simply from the high-resolution representations output by the last exchange unit, which empirically works well. The loss function, defined as the mean squared error, is applied for comparing the predicted heatmaps and the groundtruth heatmaps. The groundtruth heatmpas are generated by applying 2D Gaussian with standard deviation of 1 pixel centered on the grouptruth location of each keypoint.

Network instantiation

We instantiate the network for keypoint heatmap estimation by following the design rule of ResNet to distribute the depth to each stage and the number of channels to each resolution.

The main body, i.e., our HRNet, contains four stages with four parallel subnetworks, whose the resolution is gradually decreased to a half and accordingly the width (the number of channels) is increased to the double. The first stage contains 4 residual units where each unit, the same to the ResNet-50, is formed by a bottleneck with the width 64, and is followed by one 3×3 convolution reducing the width of feature maps to $\small C$ . The 2nd, 3rd, 4th stages contain 1, 4, 3 exchange blocks, respectively. One exchange block contains 4 residual units where each unit contains two 3 × 3 convolutions in each resolution and an exchange unit across resolutions. In summary, there are totally 8 exchange units, i.e., 8 multi-scale fusions are conducted.

In our experiments, we study one small net and one big net: HRNet-W32 and HRNet-W48, where 32 and 48 represent the widths (C) of the high-resolution subnetworks in last three stages, respectively. The widths of other three parallel subnetworks are 64, 128, 256 for HRNet-W32, and 96, 192, 384 for HRNet-W48.

详细介绍了HRNet的结构：

1. 4 stage, 4 parallel subnetworks

2. stage 1: 4 个残差网络模块，与 ResNet-50里面的单元相同（bottleneck 结构，含有64个feature map）。

3. stage 2，3，4 分别有1，4，3个转换 blocks。

4. 每个转换 blocks 包含 4 个残差模块；每个残差模块包含两个 3x3 conv. （分辨率不变）和 1 个转换单元（降维或升维）。

5. 关于特征个数，有两种：

HRNet-W32 （第1个尺度，后三个stage包含32个特征图；后面三个尺度分别有64,128,256 个特征图）；

HRNet-W48 （第1个尺度，后三个stage包含48个特征图；后面三个尺度分别有96,192,384 个特征图）。

Experiments

COCO Keypoint Detection

MPII Human Pose Estimation

Application to Pose Tracking

因为暂时不做这个方面的应用，暂时也就没详细看实验部分。

Ablation Study

ablation 部分主要分析了三部分：

Repeated multi-scale fusion：

(a) W/o intermediate exchange units (1 fusion): There is no exchange between multi-resolution subnetworks except the last exchange unit.

(b) W/ across-stage exchange units only (3 fusions): There is no exchange between parallel subnetworks within each stage.

(c) W/ both across-stage and within-stage exchange units (totally 8 fusion): This is our proposed method.

当然是8个 fusion 的效果好咯。

Resolution maintenance

all the four high-to-low resolution subnetworks are added at the beginning and the depth are the same; the fusion schemes are the same to ours.

这个实验可能是说，如果 high-to-low resolution 的 subnetwork 从一开始就有，即网络整体呈现的不是倒三角行，而是矩形。

实验结果发现这样的网络虽然增加了很多网络层，但效果不好，原因是：

We believe that the reason is that the low-level features extracted from the early stages over the low-resolution subnetworks are less helpful. In addition, the simple high-resolution network of similar parameter and computation complexities without low-resolution parallel subnetworks shows much lower performance.

1. low-level特征在前几个stage起到的作用不大，因为它们都是低分辨率的，low-level特征也不明显。

2. 此外，在没有低分辨率并行子网的情况下，具有相似参数和计算复杂度的简单高分辨率网络的性能要低得多。（没看懂）

Representation resolution

We study how the representation resolution affects the pose estimation performance from two aspects: check the quality of the heatmap estimated from the feature maps of each resolution from high to low, and study how the input size affects the quality.

两个方面研究分辨率：

1. 比较不同分辨率下 feature map 在估计 heatmap 的准确度；

2. 改变输入的分辨率。