ALBERT与BERT的异同

论文地址：https://openreview.net/pdf?id=H1eA7AEtvS中文预训练ALBERT模型：https://github.com/brightmart/albert_zh1、对Embedding因式分解（Factorized embedding parameterization）在BERT中，词embedding与encoder输出的embedding维...

文章共2,078字 · 阅读需要大约7分钟

一键AI生成摘要，助你高效阅读

问答

nathan_deep

6961人浏览 · 2020-01-21 15:59:15

nathan_deep · 2020-01-21 15:59:15 发布

论文地址：https://openreview.net/pdf?id=H1eA7AEtvS

中文预训练ALBERT模型：https://github.com/brightmart/albert_zh

1、对Embedding因式分解（Factorized embedding parameterization）

在BERT中，词embedding与encoder输出的embedding维度是一样的都是768。但是ALBERT认为，词级别的embedding是没有上下文依赖的表述，而隐藏层的输出值不仅包括了词本生的意思还包括一些上下文信息，理论上来说隐藏层的表述包含的信息应该更多一些，因此应该让 $H\gg E$ ，所以ALBERT的词向量的维度是小于encoder输出值维度的。

在NLP任务中，通常词典都会很大，embedding matrix的大小是 $E\times V$ ，如果和BERT一样让 $H=E$ ，那么embedding matrix的参数量会很大，并且反向传播的过程中，更新的内容也比较稀疏。

结合上述说的两个点，ALBERT采用了一种因式分解的方法来降低参数量。首先把one-hot向量映射到一个低维度的空间，大小为E，然后再映射到一个高维度的空间，说白了就是先经过一个维度很低的embedding matrix，然后再经过一个高维度matrix把维度变到隐藏层的空间内，从而把参数量从 $O(V\times H)$ 降低到了 $O(V\times E+E\times ×H)$ ，当 $E\ll H$ 时参数量减少的很明显。
modeling.py中，

embedding因式分解的tensorflow代码如下：

def embedding_lookup_factorized(input_ids, # Factorized embedding parameterization provide by albert
                     vocab_size,
                     hidden_size,
                     embedding_size=128,
                     initializer_range=0.02,
                     word_embedding_name="word_embeddings",
                     use_one_hot_embeddings=False):
    """
    :param input_ids: [batch_size, seq_length]
    :param vocab_size: 
    :param hidden_size: 
    :param embedding_size: 
    :param initializer_range: 
    :param word_embedding_name: 
    :param use_one_hot_embeddings: 
    :return: 
    """
    # 1. 将one-hot向量映射到embedding_size大小的低维稠密空间
    print("embedding_lookup_factorized. factorized embedding parameterization is used.")
    if input_ids.shape.ndims == 2:
        input_ids = tf.expand_dims(input_ids, axis=[-1])  # shape of input_ids is:[ batch_size, seq_length, 1]

    embedding_table = tf.get_variable(  # [vocab_size, embedding_size]
        name=word_embedding_name,
        shape=[vocab_size, embedding_size],
        initializer=create_initializer(initializer_range))

    flat_input_ids = tf.reshape(input_ids, [-1])  # one rank. shape as (batch_size * sequence_length,)
    if use_one_hot_embeddings:
        one_hot_input_ids = tf.one_hot(flat_input_ids,depth=vocab_size)  
        output_middle = tf.matmul(one_hot_input_ids, embedding_table)  # [batch_size * sequence_length,embedding_size]
    else:
        output_middle = tf.gather(embedding_table,flat_input_ids)  # [batch_size * sequence_length,embedding_size]

    # 2. 将第一步的输出映射到hidden_size的向量空间
    project_variable = tf.get_variable(  # [embedding_size, hidden_size]
        name=word_embedding_name+"_2",
        shape=[embedding_size, hidden_size],
        initializer=create_initializer(initializer_range))
    output = tf.matmul(output_middle, project_variable) # [batch_size * sequence_length, hidden_size]
    # reshape back to 3 rank
    input_shape = get_shape_list(input_ids)  
    batch_size, sequene_length, _=input_shape
    output = tf.reshape(output, (batch_size,sequene_length,hidden_size))  # [batch_size, sequence_length, hidden_size]
    return (output, embedding_table, project_variable)

2、跨层的参数共享（Cross-layer parameter sharing）

在ALBERT还提出了一种参数共享的方法，Transformer中共享参数有多种方案，只共享全连接层，只共享attention层，ALBERT结合了上述两种方案，全连接层与attention层都进行参数共享，也就是说共享encoder内的所有参数，同样量级下的Transformer采用该方案后实际上效果是有下降的，但是参数量减少了很多，训练速度也提升了很多。

modeling.py中，在variable_scope中设置reuse=True实现跨层共享参数。

在原始的Transformer中，Layer Norm在跟在Residual之后的，我们把这个称为Post-LN Transformer。Post-LN Transformer对参数非常敏感，需要很仔细地调参才能取得好的结果，比如必备的warm-up学习率策略，这会非常耗时间。

既然warm-up是训练的初始阶段使用的，那肯定是训练的初始阶段优化有问题，包括模型的初始化。

Post-LN Transformer在训练的初始阶段，输出层附近的期望梯度非常大，所以，如果没有warm-up，模型优化过程就会炸裂，非常不稳定。把LayerNorm换个位置，比如放在Residual的过程之中（称为Pre-LN Transformer），再观察训练初始阶段的梯度变化，发现比Post-LN Transformer好很多，甚至不需要warm-up，从而进一步减少训练时间。

参考论文：On Layer Normalization in the TransformerArchitecture

def prelln_transformer_model(input_tensor,
						attention_mask=None,
						hidden_size=768,
						num_hidden_layers=12,
						num_attention_heads=12,
						intermediate_size=3072,
						intermediate_act_fn=gelu,
						hidden_dropout_prob=0.1,
						attention_probs_dropout_prob=0.1,
						initializer_range=0.02,
						do_return_all_layers=False,
						shared_type='all', # None,
						adapter_fn=None):
	
	prev_output = bert_utils.reshape_to_matrix(input_tensor)

	all_layer_outputs = []

	def layer_scope(idx, shared_type):
		if shared_type == 'all':
			tmp = {
				"layer":"layer_shared",
				'attention':'attention',
				'intermediate':'intermediate',
				'output':'output'
			}
		elif shared_type == 'attention':
			tmp = {
				"layer":"layer_shared",
				'attention':'attention',
				'intermediate':'intermediate_{}'.format(idx),
				'output':'output_{}'.format(idx)
			}
		elif shared_type == 'ffn':
			tmp = {
				"layer":"layer_shared",
				'attention':'attention_{}'.format(idx),
				'intermediate':'intermediate',
				'output':'output'
			}
		else:
			tmp = {
				"layer":"layer_{}".format(idx),
				'attention':'attention',
				'intermediate':'intermediate',
				'output':'output'
			}

		return tmp

	all_layer_outputs = []

	for layer_idx in range(num_hidden_layers):

		idx_scope = layer_scope(layer_idx, shared_type)
        # 跨层共享参数
		with tf.variable_scope(idx_scope['layer'], reuse=tf.AUTO_REUSE):
			layer_input = prev_output
            # 共享注意力层的参数
			with tf.variable_scope(idx_scope['attention'], reuse=tf.AUTO_REUSE):
				attention_heads = []
                # 共享全连接层的参数，改为Pre-LN
				with tf.variable_scope("output", reuse=tf.AUTO_REUSE):
					layer_input_pre = layer_norm(layer_input)

				with tf.variable_scope("self"):
					attention_head = attention_layer(
							from_tensor=layer_input_pre,
							to_tensor=layer_input_pre,
							attention_mask=attention_mask,
							num_attention_heads=num_attention_heads,
							size_per_head=attention_head_size,
							attention_probs_dropout_prob=attention_probs_dropout_prob,
							initializer_range=initializer_range,
							do_return_2d_tensor=True,
							batch_size=batch_size,
							from_seq_length=seq_length,
							to_seq_length=seq_length)
					attention_heads.append(attention_head)

				attention_output = None
				if len(attention_heads) == 1:
					attention_output = attention_heads[0]
				else:
					# In the case where we have other sequences, we just concatenate
					# them to the self-attention head before the projection.
					attention_output = tf.concat(attention_heads, axis=-1)

				# Run a linear projection of `hidden_size` then add a residual
				# with `layer_input`.
                # 共享全连接层的参数
				with tf.variable_scope("output", reuse=tf.AUTO_REUSE):
					attention_output = tf.layers.dense(
							attention_output,
							hidden_size,
							kernel_initializer=create_initializer(initializer_range))
					attention_output = dropout(attention_output, hidden_dropout_prob)

					# attention_output = layer_norm(attention_output + layer_input)
					attention_output = attention_output + layer_input
            # 共享全连接层的参数
			with tf.variable_scope(idx_scope['output'], reuse=tf.AUTO_REUSE):
				attention_output_pre = layer_norm(attention_output)

            # 共享全连接层的参数
			with tf.variable_scope(idx_scope['intermediate'], reuse=tf.AUTO_REUSE):
				intermediate_output = tf.layers.dense(
						attention_output_pre,
						intermediate_size,
						activation=intermediate_act_fn,
						kernel_initializer=create_initializer(initializer_range))

            # 共享全连接层的参数
			with tf.variable_scope(idx_scope['output'], reuse=tf.AUTO_REUSE):
				layer_output = tf.layers.dense(
						intermediate_output,
						hidden_size,
						kernel_initializer=create_initializer(initializer_range))
				layer_output = dropout(layer_output, hidden_dropout_prob)

				# layer_output = layer_norm(layer_output + attention_output)
				layer_output = layer_output + attention_output
				prev_output = layer_output
				all_layer_outputs.append(layer_output)

	if do_return_all_layers:
		final_outputs = []
		for layer_output in all_layer_outputs:
			final_output = bert_utils.reshape_from_matrix(layer_output, input_shape)
			final_outputs.append(final_output)
		return final_outputs
	else:
		final_output = bert_utils.reshape_from_matrix(prev_output, input_shape)
		return final_output

3、句间连贯（Inter-sentence coherence loss）

BERT的NSP任务实际上是一个二分类，训练数据的正样本是通过采样同一个文档中的两个连续的句子，而负样本是通过采用两个不同的文档的句子。

在ALBERT中，为了只保留一致性任务去除主题识别的影响，提出了一个新的任务 sentence-order prediction（SOP）。

NSP（Next Sentence Prediction）：下一句预测，正样本=上下相邻的2个句子，负样本=随机2个句子
SOP (Sentence )：句子顺序预测，正样本=正常顺序的2个相邻句子，负样本=调换顺序的2个相邻句子

对于NLI自然语言推理任务。研究发现NSP任务效果并不好，主要原因是因为其任务过于简单。NSP其实包含了两个子任务，主题预测与关系一致性预测，但是主题预测相比于关系一致性预测简单太多了，因为只要模型发现两个句子的主题不一样就行了，而SOP预测任务能够让模型学习到更多的信息。SOP因为是在同一个文档中选的，其只关注句子的顺序并没有主题方面的影响。

在create_pretraining_data.py的create_instances_from_document_albert函数中，负例的选取是调换正常顺序的两个句子。

GitCode 开源社区

旨在为数千万中国开发者提供一个无缝且高效的云端环境，以支持学习、使用和贡献开源项目。

更多推荐

【目标检测】目标检测的一些常用神经网络模型及方法

我的阶段性总结????文章目录1.概述1.2 目标检测的任务1.3 目标检测的分类2.R-CNN系列2.1 [R-CNN（Region with CNN features）](https://arxiv.org/pdf/1311.2524.pdf)2.2 [Fast R-CNN](https://www.cv-foundation.org/openaccess/content_iccv_2015/