定义 BERT https://github.com/xu-song/bert-as-language-model
https://stackoverflow.com/questions/63030692/how-do-i-use-bertformaskedlm-or-bertmodel-to-calculate-perplexity-of-a-sentence
https://github.com/ymcui/Chinese-BERT-wwm
对于给定的sentence,按顺序依次mask掉一个token,并计算所预测单词的nll loss,将所有的token的loss求和再取平均,最后取以自然数为底的次方即为该句话的PPL。
测试写法:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 import numpy as npimport torchimport torch.nn as nnfrom transformers import BertTokenizer, BertForMaskedLMwith torch.no_grad(): model = BertForMaskedLM.from_pretrained('hfl/chinese-bert-wwm-ext' ) model.eval () tokenizer = BertTokenizer.from_pretrained('hfl/chinese-bert-wwm-ext' ) sentence = "我不会忘记和你一起奋斗的时光。" tokenize_input = tokenizer.tokenize(sentence) tensor_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)]) sen_len = len (tokenize_input) sentence_loss = 0. for i, word in enumerate (tokenize_input): tokenize_input[i] = '[MASK]' mask_input = torch.tensor([tokenizer.convert_tokens_to_ids(tokenize_input)]) output = model(mask_input) prediction_scores = output[0 ] softmax = nn.Softmax(dim=0 ) ps = softmax(prediction_scores[0 , i]).log() word_loss = ps[tensor_input[0 , i]] sentence_loss += word_loss.item() tokenize_input[i] = word ppl = np.exp(-sentence_loss/sen_len) print (ppl)
tensor思维的写法:
1 2 3 4 5 6 7 8 9 10 11 12 def score (model, tokenizer, sentence, mask_token_id=103 ): tensor_input = tokenizer.encode(sentence, return_tensors='pt' ) repeat_input = tensor_input.repeat(tensor_input.size(-1 )-2 , 1 ) mask = torch.ones(tensor_input.size(-1 ) - 1 ).diag(1 )[:-2 ] masked_input = repeat_input.masked_fill(mask == 1 , 103 ) labels = repeat_input.masked_fill( masked_input != 103 , -100 ) loss,_ = model(masked_input, masked_lm_labels=labels) result = np.exp(loss.item()) return result s = score(model, tokenizer, '我不会忘记和你一起奋斗的时光。' )print (s)
GPT-2 https://github.com/Morizeyao/GPT2-Chinese
官方的gpt-2不支持中文,且是BPE分词方式。对于中文,有NLPer训练出了中文的gpt-2模型,且分词采用的是bert tokenizer的分词方式。
对于给定的sentence,若其长度为n,首先将其向左偏移一位作为label,将其去除末位作为input,将gpt-2的输出与label求cross entroy loss,再求以自然数为底的次方即为该句话的PPL。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 import torchfrom transformers import BertTokenizer, GPT2LMHeadModelfrom torch.nn import CrossEntropyLossdef cal_ppl_bygpt2 (): sens = ["今天是个好日子。" , "天今子日。个是好" , "这个婴儿有900000克呢。" , "我不会忘记和你一起奋斗的时光。" , "我不会记忘和你一起奋斗的时光。" , "会我记忘和你斗起一奋的时光。" ] tokenizer = BertTokenizer.from_pretrained("uer/gpt2-chinese-cluecorpussmall" ) model = GPT2LMHeadModel.from_pretrained("uer/gpt2-chinese-cluecorpussmall" ) inputs = tokenizer(sens, padding='max_length' , max_length=50 , truncation=True , return_tensors="pt" ) bs, sl = inputs['input_ids' ].size() outputs = model(**inputs, labels=inputs['input_ids' ]) logits = outputs[1 ] shift_logits = logits[:, :-1 , :].contiguous() shift_labels = inputs['input_ids' ][:, 1 :].contiguous() shift_attentions = inputs['attention_mask' ][:, 1 :].contiguous() loss_fct = CrossEntropyLoss(ignore_index=0 , reduction="none" ) loss = loss_fct(shift_logits.view(-1 , shift_logits.size(-1 )), shift_labels.view(-1 )).detach().reshape(bs, -1 ) meanloss = loss.sum (1 ) / shift_attentions.sum (1 ) ppl = torch.exp(meanloss).numpy().tolist() return pplif __name__ == '__main__' : cal_ppl_bygpt2()