【大模型】公主大人，别再用jieba做分词了！看看隔壁ChatGLM用了什么高科技！

精选原创

小殊小殊 2024-04-11 09:51:01 博主文章分类：大模型 ©著作权

文章标签 语言模型 chatgpt 人工智能深度学习 transformer 文章分类 DALL·E 2 AIGC

©著作权归作者所有：来自51CTO博客作者小殊小殊的原创作品，请联系作者获取转载授权，否则将追究法律责任

一、介绍

二、运行程序

三、词典

1.生成字典

2.特殊字符

四、编码过程

1.删除空格、变小写

2.转换回车、制表符和空格

3.虚拟空格

4.生成token_id

5.拼接特殊字符

五、解码过程

一、介绍

ChatGLM是优秀的国产开源大模型，研究的人也比较多，要用它完成自己的任务，还是需要了解它的一些玩法，细节还是很多的。ChatGLM已经更新了几个版本，我就从第一版代码开始记录笔记，后面的版本都是在前一版本进行修改，不会有天翻地覆的变化，所以看到新版本的时候只需要关注变化就可以啦。

大模型的内容肯定是很多的，就从比较前置的Tokenizer开始吧。

二、运行程序

首先下载ChatGLM项目，尽，下载稳定些。

ChatGLM-6B：https://github.com/THUDM/ChatGLM-6B

模型文件：https://huggingface.co/THUDM/chatglm-6b/tree/main

下载完成后，把模型文件放在项目目录的THUDM/chatglm-6b中，执行下面的代码能出结果，证明程序运行正常：

from transformers import AutoTokenizer, AutoConfig


if __name__ == "__main__":
    model_name = "THUDM/chatglm-6b"
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    text = "我爱学习"
    tokens = tokenizer.encode(text)
    print("tokens:", tokens)
    ''' 打印结果：
    tokens: [5, 76202, 63992, 130001, 130004]
    '''

咱们再来看模型文件，Tokenizer相关的文件有三个，如下图：

【大模型】公主大人，别再用jieba做分词了！看看隔壁ChatGLM用了什么高科技！_人工智能

ice_text.model：存储分词模型的参数文件；

tokenization_chatglm.py：实现分词相关的逻辑；

tokenizer_config.json：分词的配置文件

三、词典

1.生成字典

我们可以通过下面的代码查看词典规模，运行下面的代码我们将得到完整的词典，存在vocab.txt文件中：

import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.load('THUDM/chatglm-6b/ice_text.model')
save_vocab = []
for id in range(sp.vocab_size()):
    save_vocab.append(str(id)+"\t"+sp.id_to_piece(id))
    print(sp.id_to_piece(id))
with open("vocab.txt", 'w+', encoding='utf-8') as f:
    f.write('\n'.join(save_vocab))

vocab.txt文件也可以直接下载：

分析vocab.txt文件我们可以发现词典规模130344，而且中英文的比例基本保持在1:1。

2.特殊字符

下面是模型用到的特殊字符：

特殊字符	token_id	说明
<n>	4	回车
▁	5	连接符，标记了一个词的开头
[gMASK]	130001	生成下文用的mask
<sop>	130004	output的开始
<eop>	130005	output的结尾
<\|tab\|>	130008	制表符
<\|blank_{length}\|>	130009-130087	每n个连续的空格会被组成一个特殊字符，上限80，即<\|blank_80\|>

（1）连接符

ChatGLM和LLaMA的分词都用了SentencePiece 库，SentencePiece 库的_EncodeAsPiecesBatch 方法返回的每段（每段是用空格分隔的）数据最前面有一个特殊的下划线 ▁，我们称之为连接符。因为 SentencePiece 使用连接符来表示一个词的开始。值得注意的是他不是普通的下划线，普通的下划线是这样的_。连接符标记了一个词的开头，这有助于区分连续的词汇。

这样做的目的有如下两个好处：

a.词边界标记：SentencePiece 处理的文本通常没有明确的空格或者其他明显的词边界标记（尤其是在某些亚洲语言中）。使用连接符作为词的前缀可以帮助模型识别词的边界。

b.可逆性：在 SentencePiece 的编码和解码过程中，连接符的使用保证了操作的可逆性。这意味着你可以从编码的子词序列准确地重建原始文本，包括空格和词边界。

下面看一个有意思的例子：

from transformers import AutoTokenizer, AutoConfig
if __name__ == "__main__":
    model_name = "THUDM/chatglm-6b"
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    vocab = tokenizer.get_vocab()
    vocab_exchange = dict([val, key] for key, val in vocab.items())
    text1 = "苹果我是昨天买的"
    tokens1 = tokenizer.encode(text1, add_special_tokens=False)
    print("tokens1:", tokens1)
    participles1 = [vocab_exchange[token] for token in tokens1]
    print("participles1:", participles1)
    text2 = "我是昨天买的苹果"
    tokens2 = tokenizer.encode(text2, add_special_tokens=False)
    print("tokens2:", tokens2)
    participles2 = [vocab_exchange[token] for token in tokens2]
    print("participles2:", participles2)

'''
tokens1: [5, 65319, 65806, 67363, 68543]
participles1: ['▁', '苹果', '我是', '昨天', '买的']
tokens2: [71232, 67363, 68543, 65319]
participles2: ['▁我是', '昨天', '买的', '苹果']
'''

可以看到第一个例子符合我们前面说的每段的开头会自动加一个▁ 但是第二个例子的▁被融合到了起始的分词中，这是因为在这段的开头加完▁后，能在词典中找到能匹配的'▁我是'，根据匹配是长度优先的原则，肯定是选择组合成一个：'▁我是'，而不是分成两个：'▁'和'我是'。

再看一下“每段”的概念，段是单独的用空格分隔的，下面的例子一目了然，每个单独的空格会认为是新的开始。值得注意的是“单独的空格”会被用作分段，多个空格会被是做普通的空格并合并成<|blank|>标记，如下面的第三个例子：

from transformers import AutoTokenizer, AutoConfig
if __name__ == "__main__":
    model_name = "THUDM/chatglm-6b"
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    vocab = tokenizer.get_vocab()
    vocab_exchange = dict([val, key] for key, val in vocab.items())
    # 1
    text1 = "Hello World"
    tokens1 = tokenizer.encode(text1, add_special_tokens=False)
    print("tokens1:", tokens1)
    participles1 = [vocab_exchange[token] for token in tokens1]
    print("participles1:", participles1)
    # 2
    text2 = "我是 昨天买的苹果"
    tokens2 = tokenizer.encode(text2, add_special_tokens=False)
    print("tokens2:", tokens2)
    participles2 = [vocab_exchange[token] for token in tokens2]
    print("participles2:", participles2)
    # 3
    text3 = "我是  昨天买的苹果"
    tokens3 = tokenizer.encode(text3, add_special_tokens=False)
    print("tokens3:", tokens3)
    participles3 = [vocab_exchange[token] for token in tokens3]
    print("participles3:", participles3)

'''
tokens1: [14833, 398]
participles1: ['▁hello', '▁world']
tokens2: [71232, 70831, 68543, 65319]
participles2: ['▁我是', '▁昨天', '买的', '苹果']
tokens3: [71232, 130009, 67363, 68543, 65319]
participles3: ['▁我是', '<|blank_2|>', '昨天', '买的', '苹果']
'''

（2）[gMASK]

[gMASK]是生成下文用的mask，表示从这里开始往下生成，在训练的时候会先mask掉[gMASK]后面的内容，然后预测后面的内容。ChatGLM的注意力模式是Prefix decoder，也就是下面的第二种，[gMASK]的功能可以理解为分隔input和output，这个到介绍结构时再说。

【大模型】公主大人，别再用jieba做分词了！看看隔壁ChatGLM用了什么高科技！_语言模型_02

（3）<sop> 和 <eop>

ChatGLM中的这两个标记分别被当做<bos>（Beginning Of Sentence）和<eos>（Ending Of Sentence）来使用，会被加在output的头尾。

下面看一个例子，数据是训练集中的一行，因为是训练数据所以是有明确的输出作为Ground Truth，训练之前数据预处理的过程就是这样的：

from transformers import AutoTokenizer, AutoConfig


def preprocess(tokenizer, config, example, max_seq_length):
    prompt = example["context"]
    target = example["target"]
    prompt_ids = tokenizer.encode(prompt, max_length=max_seq_length, truncation=True)
    target_ids = tokenizer.encode(
        target,
        max_length=max_seq_length,
        truncation=True,
        add_special_tokens=False)
    input_ids = prompt_ids + target_ids + [config.eos_token_id]
    return {"input_ids": input_ids, "seq_len": len(prompt_ids)}


if __name__ == "__main__":
    model_name = "THUDM/chatglm-6b"
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    config = AutoConfig.from_pretrained(model_name, trust_remote_code=True, device_map='auto')
    max_seq_length = 200
    example = {
        "context": "你是谁",
        "target": "人家是城堡中的小公主"
    }
    token = preprocess(tokenizer, config, example, max_seq_length)
    print("token:", token)

'''
token: {'input_ids': [5, 108293, 130001, 130004, 5, 65870, 63829, 75581, 64102, 103559, 130005], 'seq_len': 4}
'''

上面的代码实现的是将问答对转换成tokens，数据的转换过程如下：

【大模型】公主大人，别再用jieba做分词了！看看隔壁ChatGLM用了什么高科技！_深度学习_03

四、编码过程

Tokenizer用了sentencepiece包，但是在调用sentencepiece之前还有很多操作，下面的例子是一行训练数据的编码过程，我们来看一下整个过程发生了什么：

from transformers import AutoTokenizer, AutoConfig


def preprocess(tokenizer, config, example, max_seq_length):
    prompt = example["context"]
    target = example["target"]
    prompt_ids = tokenizer.encode(prompt, max_length=max_seq_length, truncation=True)
    target_ids = tokenizer.encode(
        target,
        max_length=max_seq_length,
        truncation=True,
        add_special_tokens=False)
    input_ids = prompt_ids + target_ids + [config.eos_token_id]
    return {"input_ids": input_ids, "seq_len": len(prompt_ids)}


if __name__ == "__main__":
    model_name = "THUDM/chatglm-6b"
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    config = AutoConfig.from_pretrained(model_name, trust_remote_code=True, device_map='auto')
    max_seq_length = 200
    example = {
        "context": "你要干什么",
        "target": "小公主   我们来玩吧\nHAHA\tHAHA"
    }
    token = preprocess(tokenizer, config, example, max_seq_length)
    print("token:", token)

'''
token: {'input_ids': [85117, 72675, 130001, 130004, 5, 103559, 130010, 63869, 111415, 63956, 4, 26650, 130008, 26650, 130005], 'seq_len': 4}
'''

下面涉及的代码没有特殊说明的都在tokenization_chatglm.py中，程序入口ChatGLMTokenizer._tokenize()。

1.删除空格、变小写

这里是可以配置的，配置项在tokenizer_config.json中：

...
  "remove_space": false,
  "do_lower_case": true,
...

因为删除空格会影响下面的<|blank|>,所以这里我只变小写，代码如下：

def preprocess_text(self, inputs):
        if self.remove_space:
            outputs = " ".join(inputs.strip().split())
        else:
            outputs = inputs

        if self.do_lower_case:
            outputs = outputs.lower()

        return outputs

2.转换回车、制表符和空格

\n替换成<n>; \t替换成<|tab|> ;空格被替换成<|blank_{length}|>，{length}是空格的个数，最多到80，值得注意的是，虽然80这个值是一个参数，但是只能小于等于80，因为词典中没有超过80的token。

代码如下：

@staticmethod
    def _encode_whitespaces(text: str, max_len: int = 80):
        # 替换制表符
        text = text.replace("\t", SPTokenizer.get_tab_token())
        # 替换空格
        for i in range(max_len, 1, -1):
            text = text.replace(" " * i, SPTokenizer.get_blank_token(i))
        return text

    def _preprocess(self, text: str, linebreak=True, whitespaces=True):
        if linebreak:
            # 替换回车
            text = text.replace("\n", "<n>")
        if whitespaces:
            text = self._encode_whitespaces(text, max_len=self.max_blank_length)
        return text

3.虚拟空格

可以在开头添加虚拟空格，其实是<n>，默认是不加这个虚拟空格的，代码如下：

4.生成token_id

上面的处理之后，调用sentencepiece的EncodeAsIds()方法生成token，特殊的下划线就是这个时候拼上的。sentencepiece还是值得研究一下的，ice_text.model也是使用它训练的，从词典能看出来，用的是BPE (Byte Pair Encoding)算法。

5.拼接特殊字符

在encode完成的tokens后面拼上130001([gMASK])和130004(<sop>)。值得注意的是，在准备数据的时候，output后面不拼这两个token而是130005(<eop>)，这一步需要我们自己做。代码如下：

def build_inputs_with_special_tokens(
            self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        """
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
        adding special tokens. A BERT sequence has the following format:

        - single sequence: `[CLS] X [SEP]`
        - pair of sequences: `[CLS] A [SEP] B [SEP]`

        Args:
            token_ids_0 (`List[int]`):
                List of IDs to which the special tokens will be added.
            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.

        Returns:
            `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
        """
        gmask_id = self.sp_tokenizer[self.gmask_token]
        eos_id = self.sp_tokenizer[self.eos_token]
        token_ids_0 = token_ids_0 + [gmask_id, self.sp_tokenizer[self.bos_token]]
        if token_ids_1 is not None:
            token_ids_0 = token_ids_0 + token_ids_1 + [eos_id]
        return token_ids_0

执行拼接，在transformers包tokenization_utils_base.py中的DispatchService.build_inputs_with_special_tokens()方法中，将特殊字符拼接到了tokens的最后面，代码如下：

def build_inputs_with_special_tokens(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
    ) -> List[int]:
        """
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
        adding special tokens.

        This implementation does not add special tokens and this method should be overridden in a subclass.

        Args:
            token_ids_0 (`List[int]`): The first tokenized sequence.
            token_ids_1 (`List[int]`, *optional*): The second tokenized sequence.

        Returns:
            `List[int]`: The model input with special tokens.
        """
        if token_ids_1 is None:
            return token_ids_0
        return token_ids_0 + token_ids_1

下面是完整编码过程的示意图，部分流程略有调整，主要是为了易于理解：

【大模型】公主大人，别再用jieba做分词了！看看隔壁ChatGLM用了什么高科技！_深度学习_04

五、解码过程

最后再看一下decode，过程比较简单，一句话就能概括。就是按照词典在把token_id转换成字符串，同时连接符会被去掉：

from transformers import AutoTokenizer, AutoConfig
if __name__ == "__main__":
    model_name = "THUDM/chatglm-6b"
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    vocab = tokenizer.get_vocab()
    vocab_exchange = dict([val, key] for key, val in vocab.items())
    tokens = [5, 19316, 932]
    participles = [vocab_exchange[token] for token in tokens]
    print("participles:", participles)
    decode_tokens = tokenizer.decode(tokens)
    print("decode_tokens:", decode_tokens)

'''
participles: ['▁', '▁Hello', '▁World']
decode_tokens: Hello World
'''

现在还有一个问题，词典（ice_text.model）是怎么生成的，ChatGLM和LLaMA其实都使用了sentencepiece包中的BPE，sentencepiece实现了BPE (Byte Pair Encoding)、Unigram、Word和Char四种算法，那这四种算法是什么，最终为什么选择BPE，因为篇（lan）幅（de）有（xie）限（le）以后会单独说。

ChatGLM的Tokenizer就介绍到这里，关注不迷路(#^.^#)...