如何改进 spaCy 中的德语文本分类模型

How to improve a German text classification model in spaCy

我正在从事一个文本分类项目并为此使用 spacy。现在我的准确率几乎等于 70%,但这还不够。过去两周我一直在尝试改进模型,但到目前为止没有成功的结果。在这里,我正在寻找关于我应该做什么或尝试什么的建议。任何帮助将不胜感激!

所以,这是我目前所做的:

1) Preparing the data:

我有一个不平衡的德国新闻数据集,包含 21 个类别(例如 POLITICSECONOMYSPORTCELEBRITIES 等)。为了使类别相等,我复制了小 类。结果,我有 21 个文件,几乎有 700 000 行文本。然后我使用以下代码规范化这些数据:

import spacy
from charsplit import Splitter

POS = ['NOUN', 'VERB', 'PROPN', 'ADJ', 'NUM']  # allowed parts of speech

nlp_helper = spacy.load('de_core_news_sm')
splitter = Splitter()

def normalizer(texts):
    arr = []  # list of normalized texts (will be returned from the function as a result of normalization)

    docs = nlp_helper.pipe(texts)  # creating doc objects for multiple lines
    for doc in docs:  # iterating through each doc object
        text = []  # list of words in normalized text
        for token in doc:  # for each word in text
            token = token.lemma_.lower()

            if token not in stop_words and token.pos_ in POS:  # deleting stop words and some parts of speech
                if len(word) > 8 and token.pos_ == 'NOUN':  # only nouns can be splitted
                    _, word1, word2 = splitter.split_compound(word)[0]  # checking only the division with highest prob
                    word1 = word1.lower()
                    word2 = word2.lower()
                    if word1 in german and word2 in german:
                        text.append(word1)
                        text.append(word2)
                    elif word1[:-1] in german and word2 in german:  # word[:-1] - checking for 's' that joins two words
                        text.append(word1[:-1])
                        text.append(word2)
                    else:
                        text.append(word)
                else:
                    text.append(word)
        arr.append(re.sub(r'[.,;:?!"()-=_+*&^@/\']', ' ', ' '.join(text))) # delete punctuation
    return arr

以上代码的一些解释:

POS - 允许的词性列表。如果我现在正在使用的单词是不在此列表中的词性 -> 我将其删除。

stop_words - 只是我删除的单词列表。

splitter.split_compound(word)[0] - returns 是复合词最有可能划分的元组(我用它把长的德文单词划分成更短和更广泛使用的词)。这是具有此功能的存储库 link

总结一下:我找到单词的词元,小写,删除停用词和部分词性,划分复合词,删除标点符号。然后我加入所有的单词和 return 一个标准化行数组。

2) Training the model

我使用 de_core_news_sm 训练我的模型(以便将来不仅可以将此模型用于分类,还可以用于归一化)。这是训练代码:

nlp = spacy.load('de_core_news_sm')

textcat = nlp.create_pipe('textcat', config={"exclusive_classes": False, "architecture": 'simple_cnn'})
nlp.add_pipe(textcat, last=True)
for category in categories:
    textcat.add_label(category)

pipe_exceptions = ["textcat"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

with nlp.disable_pipes(*other_pipes):
    optimizer = nlp.begin_training()

    for i in range(n_iter):
        shuffle(data)
        batches = spacy.util.minibatch(data)

        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.25)

以上代码的一些解释:

data - 列表列表,其中每个列表包含一行文本和一个带有类别的字典(就像 docs

'categories' - 类别列表

'n_iter' - 训练迭代次数

3) At the end I just save the model with to_disk method.

通过上面的代码,我成功地训练了一个准确率为 70% 的模型。以下是我迄今为止为提高此分数所做的尝试的列表:

1) Using another architecture (ensemble) - didn't give any improvements

2) Training on non normalized data - the result was much worse

3) Using pretrained BERT model - could'n do it ( is my unanswered question about it)

4) Training de_core_news_md instead of de_core_news_sm - didn't give any improvements (tried it because according to the docs there could be an improvement thanks to the vectors (if I understood it correctly). Correct me if I'm wrong)

5) Training on data, normalized in a slightly different way (without lower casing and punctuation deletion) - didn't give any improvements

6) Changing dropout - didn't help

所以现在我有点不知道下一步该怎么做。如果有任何提示或建议,我将不胜感激。

在此先感谢您的帮助!

我建议的第一件事是增加批量大小。之后你的优化器(如果可能的话是 Adam)和学习率,我在这里没有看到代码。您终于可以尝试更改 dropout 了。

此外,如果您正在尝试神经网络并计划进行大量更改,那么最好切换到 PyTorch 或 TensorFlow。在 PyTorch 中,您将拥有 HuggingFace 库,其中内置了 BERT。

希望对您有所帮助!