Language Model based on BERT

BERT

2018年10月谷歌AI团队发布BERT模型,在11种NLP任务测试中刷新了最佳成绩,一时风头无两。自然语言处理领域近两年最受关注,并且进展迅速的当属机器阅读理解,其中斯坦福大学于2016年提出的SQuAD数据集对于推动Machine Comprehension的发展起到了巨大的作用。SQuAD 1.0发布时,Google一直没有出手,微软曾长期占据榜首位置,阿里巴巴也曾短暂登顶。2018年1月3日微软亚洲研究院提交的R-NET模型在EM值(Exact Match表示预测答案和真实答案完全匹配)上以82.650的最高分领先,并率先超越人类分数82.304。而当谷歌一出手,便知有没有,目前SQuAD排行榜上已经被BERT霸屏,排行前列的模型几乎全部基于BERT。关于通用语言模型的介绍,可以参考另一篇翻译的博客,以及张俊林老师的介绍,参考链接附在本文末尾。

源码分析

谷歌已开放源码:

https://github.com/google-research/bert

其中create_pretraining_data.py用于创建训练数据,run_pretraining.py用于进行预训练。此外,谷歌还提供了二阶段fine tunning的训练代码,run_classifier.py用于句子分类任务,run_squad.py用于机器阅读理解任务,可直接使用。而基于BERT的语言模型可直接对预训练模型进行改造后获得,参考链接:

https://github.com/xu-song/bert-as-language-model

作者主要对get_masked_lm_output函数进行了改造,具体地,计算masked lm loss时不使用masked_lm_weights,参考代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
#原代码
def get_masked_lm_output(bert_config, input_tensor, output_weights, positions,
label_ids, label_weights):
"""Get loss and log probs for the masked LM."""
input_tensor = gather_indexes(input_tensor, positions)
with tf.variable_scope("cls/predictions"):
# We apply one more non-linear transformation before the output layer.
# This matrix is not used after pre-training.
with tf.variable_scope("transform"):
input_tensor = tf.layers.dense(
input_tensor,
units=bert_config.hidden_size,
activation=modeling.get_activation(bert_config.hidden_act),
kernel_initializer=modeling.create_initializer(
bert_config.initializer_range))
input_tensor = modeling.layer_norm(input_tensor)
# The output weights are the same as the input embeddings, but there is
# an output-only bias for each token.
output_bias = tf.get_variable(
"output_bias",
shape=[bert_config.vocab_size],
initializer=tf.zeros_initializer())
logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
log_probs = tf.nn.log_softmax(logits, axis=-1)
label_ids = tf.reshape(label_ids, [-1])
label_weights = tf.reshape(label_weights, [-1])
one_hot_labels = tf.one_hot(
label_ids, depth=bert_config.vocab_size, dtype=tf.float32)
# The `positions` tensor might be zero-padded (if the sequence is too
# short to have the maximum number of predictions). The `label_weights`
# tensor has a value of 1.0 for every real prediction and 0.0 for the
# padding predictions.
per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1])
numerator = tf.reduce_sum(label_weights * per_example_loss)
denominator = tf.reduce_sum(label_weights) + 1e-5
loss = numerator / denominator
return (loss, per_example_loss, log_probs)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#改造后代码
def get_masked_lm_output(bert_config, input_tensor, output_weights, positions,
label_ids):
"""Get loss and log probs for the masked LM."""
input_tensor = gather_indexes(input_tensor, positions)
with tf.variable_scope("cls/predictions"):
# We apply one more non-linear transformation before the output layer.
# This matrix is not used after pre-training.
with tf.variable_scope("transform"):
input_tensor = tf.layers.dense(
input_tensor,
units=bert_config.hidden_size,
activation=modeling.get_activation(bert_config.hidden_act),
kernel_initializer=modeling.create_initializer(
bert_config.initializer_range))
input_tensor = modeling.layer_norm(input_tensor)
# The output weights are the same as the input embeddings, but there is
# an output-only bias for each token.
output_bias = tf.get_variable(
"output_bias",
shape=[bert_config.vocab_size],
initializer=tf.zeros_initializer())
logits = tf.matmul(input_tensor, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
log_probs = tf.nn.log_softmax(logits, axis=-1)
label_ids = tf.reshape(label_ids, [-1])
one_hot_labels = tf.one_hot(
label_ids, depth=bert_config.vocab_size, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(log_probs * one_hot_labels, axis=[-1])
loss = tf.reshape(per_example_loss, [-1, tf.shape(positions)[1]])
# TODO: dynamic gather from per_example_loss
return loss

Python中可直接构造输入,然后利用Tensorflow高级API来获得结果:

1
result = estimator.predict(input_fn=predict_input_fn)

estimator.predict的预测结果在model_fn_builder中指定:

1
2
3
if mode == tf.estimator.ModeKeys.PREDICT:
output_spec = tf.contrib.tpu.TPUEstimatorSpec(
mode=mode, predictions=masked_lm_example_loss, scaffold_fn=scaffold_fn) # 输出mask_word的score

BERT作为语言模型时,一个不便之处是需要逐个计算每个token的prob,然后计算句子的ppl。

ppl: 自然语言处理领域(NLP)中,衡量语言模型好坏的指标。根据每个词来估计一句话出现的概率,并用句子长度作normalize,ppl值越小,表示该句子越合理。

结果解析,ppl计算代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def parse_result(result, all_tokens, output_file=None):
with tf.gfile.GFile(output_file, "w") as writer:
tf.logging.info("***** Predict results *****")
i = 0
sentences = []
for word_loss in result:
# start of a sentence
if all_tokens[i] == "[CLS]":
sentence = {}
tokens = []
sentence_loss = 0.0
word_count_per_sent = 0
i += 1
# add token
tokens.append({"token": tokenization.printable_text(all_tokens[i]),
"prob": float(np.exp(-word_loss[0])) })
sentence_loss += word_loss[0]
word_count_per_sent += 1
i += 1
token_count_per_word = 0
while is_subtoken(all_tokens[i]):
token_count_per_word += 1
tokens.append({"token": tokenization.printable_text(all_tokens[i]),
"prob": float(np.exp(-word_loss[token_count_per_word]))})
sentence_loss += word_loss[token_count_per_word]
i += 1
# end of a sentence
if all_tokens[i] == "[SEP]":
sentence["tokens"] = tokens
sentence["ppl"] = float(np.exp(sentence_loss / word_count_per_sent))
sentences.append(sentence)
i += 1
if output_file is not None:
tf.logging.info("Saving results to %s" % output_file)
writer.write(json.dumps(sentences, indent=2, ensure_ascii=False))

模型训练、导出和部署

由于预训练模型中masked lm loss节点并未命名,所以添加name后需要启动很短暂的预训练,同时将模型导出。get_masked_lm_output函数参考bert-as-language-model中的代码进行相应改造。Tensorflow版本升级后,使用estimator接受输入,原来我们最爱的placeholder找不到了,而在部署模型时,仍需要使用placeholder接受输入,可在run_pretraining.py导出模型时添加如下代码:

1
2
3
4
5
6
7
8
9
10
if FLAGS.do_export:
estimator._export_to_tpu = False
name_to_features = [("input_ids", tf.int32), ("input_mask", tf.int32),
("segment_ids", tf.int32), ("masked_lm_positions", tf.int32), ("masked_lm_ids", tf.int32),
("masked_lm_weights", tf.float32), ("next_sentence_labels", tf.int32)]
feature_placeholders = {name: tf.placeholder(dtype, [1, FLAGS.max_seq_length],
name='bert/' + name + "_placeholder") for name, dtype in
name_to_features}
serving_input_fn = tf.estimator.export.build_raw_serving_input_receiver_fn(feature_placeholders)
path = estimator.export_savedmodel("./export/", serving_input_fn)

本文部署模型使用golang语言,基于tfgo实现模型的加载和tensorflow对应节点的计算。

1
2
//模型加载
model := tg.LoadModel(*modleDir, []string{"serve"}, nil)

参考run_pretraining.py导出模型时的代码,golang程序中需要构造7个输入,而masked_lm_weights和next_sentence_labels对于语言模型没有影响,可按自己喜爱构造。以下面的例子说明一下输入的构造标准:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
//输入句子
何异浮云过太空
//处理后的句子
[[CLS] 何 异 浮 云 过 太 空 [SEP]]
------------------------token 1-----------------------------
//input_ids
[101 103 2460 3859 756 6814 1922 4958 102 0 0 0 ...]
//input_mask
[1 1 1 1 1 1 1 1 1 0 0 0 ...]
//segment_ids
[0 0 0 ...]
//masked_lm_positions
[1 0 0 0 ...]
//masked_lm_ids
[862 0 0 0 0 0 ...]
------------------------token 2----------------------------
[101 862 103 3859 756 6814 1922 4958 102 0 0 0 ...]
[1 1 1 1 1 1 1 1 1 0 0 0 ...]
[0 0 0 ...]
[2 0 0 0 ...]
[2460 0 0 0 ...]

Golang程序计算句子ppl:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
for i := 1; i < ids_len - 1; i++{
...
result := model.Exec([]tf.Output{
model.Op("cls/predictions/lm_loss", 0),
}, map[tf.Output]*tf.Tensor{
model.Op("bert/input_ids_placeholder", 0): inputX1,
model.Op("bert/input_mask_placeholder", 0): inputX2,
model.Op("bert/segment_ids_placeholder", 0): inputX3,
model.Op("bert/masked_lm_positions_placeholder", 0): inputX4,
model.Op("bert/masked_lm_ids_placeholder", 0): inputX5,
model.Op("bert/masked_lm_weights_placeholder", 0): inputX6,
model.Op("bert/next_sentence_labels_placeholder", 0): inputX7,
})
val := result[0].Value().([][]float32)[0][0]
sentence_loss += float64(val)
...
}
ppl := math.Pow(math.E, sentence_loss / float64(ids_len))

附录

从Word Embedding到Bert模型
效果惊人的GPT 2.0模型
通用语言模型