如何改进 spaCy 中的德语文本分类模型

Question

我正在从事一个文本分类项目并为此使用 spacy。现在我的准确率几乎等于 70%，但这还不够。过去两周我一直在尝试改进模型，但到目前为止没有成功的结果。在这里，我正在寻找关于我应该做什么或尝试什么的建议。任何帮助将不胜感激！

所以，这是我目前所做的：

1) Preparing the data:

我有一个不平衡的德国新闻数据集，包含 21 个类别（例如 POLITICS、ECONOMY、SPORT、CELEBRITIES 等）。为了使类别相等，我复制了小类。结果，我有 21 个文件，几乎有 700 000 行文本。然后我使用以下代码规范化这些数据：

import spacy
from charsplit import Splitter

POS = ['NOUN', 'VERB', 'PROPN', 'ADJ', 'NUM']  # allowed parts of speech

nlp_helper = spacy.load('de_core_news_sm')
splitter = Splitter()

def normalizer(texts):
    arr = []  # list of normalized texts (will be returned from the function as a result of normalization)

    docs = nlp_helper.pipe(texts)  # creating doc objects for multiple lines
    for doc in docs:  # iterating through each doc object
        text = []  # list of words in normalized text
        for token in doc:  # for each word in text
            token = token.lemma_.lower()

            if token not in stop_words and token.pos_ in POS:  # deleting stop words and some parts of speech
                if len(word) > 8 and token.pos_ == 'NOUN':  # only nouns can be splitted
                    _, word1, word2 = splitter.split_compound(word)[0]  # checking only the division with highest prob
                    word1 = word1.lower()
                    word2 = word2.lower()
                    if word1 in german and word2 in german:
                        text.append(word1)
                        text.append(word2)
                    elif word1[:-1] in german and word2 in german:  # word[:-1] - checking for 's' that joins two words
                        text.append(word1[:-1])
                        text.append(word2)
                    else:
                        text.append(word)
                else:
                    text.append(word)
        arr.append(re.sub(r'[.,;:?!"()-=_+*&^@/\']', ' ', ' '.join(text))) # delete punctuation
    return arr

以上代码的一些解释：

POS - 允许的词性列表。如果我现在正在使用的单词是不在此列表中的词性 -> 我将其删除。

stop_words - 只是我删除的单词列表。

splitter.split_compound(word)[0] - returns 是复合词最有可能划分的元组（我用它把长的德文单词划分成更短和更广泛使用的词）。这是具有此功能的存储库 link。

总结一下：我找到单词的词元，小写，删除停用词和部分词性，划分复合词，删除标点符号。然后我加入所有的单词和 return 一个标准化行数组。

2) Training the model

我使用 de_core_news_sm 训练我的模型（以便将来不仅可以将此模型用于分类，还可以用于归一化）。这是训练代码：

nlp = spacy.load('de_core_news_sm')

textcat = nlp.create_pipe('textcat', config={"exclusive_classes": False, "architecture": 'simple_cnn'})
nlp.add_pipe(textcat, last=True)
for category in categories:
    textcat.add_label(category)

pipe_exceptions = ["textcat"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

with nlp.disable_pipes(*other_pipes):
    optimizer = nlp.begin_training()

    for i in range(n_iter):
        shuffle(data)
        batches = spacy.util.minibatch(data)

        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, sgd=optimizer, drop=0.25)

以上代码的一些解释：

data - 列表列表，其中每个列表包含一行文本和一个带有类别的字典（就像 docs）

'categories' - 类别列表

'n_iter' - 训练迭代次数

3) At the end I just save the model with to_disk method.

通过上面的代码，我成功地训练了一个准确率为 70% 的模型。以下是我迄今为止为提高此分数所做的尝试的列表：

1) Using another architecture (ensemble) - didn't give any improvements

2) Training on non normalized data - the result was much worse

3) Using pretrained BERT model - could'n do it ( is my unanswered question about it)

4) Training de_core_news_md instead of de_core_news_sm - didn't give any improvements (tried it because according to the docs there could be an improvement thanks to the vectors (if I understood it correctly). Correct me if I'm wrong)

5) Training on data, normalized in a slightly different way (without lower casing and punctuation deletion) - didn't give any improvements

6) Changing dropout - didn't help

所以现在我有点不知道下一步该怎么做。如果有任何提示或建议，我将不胜感激。

在此先感谢您的帮助！

Answer 1

我建议的第一件事是增加批量大小。之后你的优化器（如果可能的话是 Adam）和学习率，我在这里没有看到代码。您终于可以尝试更改 dropout 了。

此外，如果您正在尝试神经网络并计划进行大量更改，那么最好切换到 PyTorch 或 TensorFlow。在 PyTorch 中，您将拥有 HuggingFace 库，其中内置了 BERT。

希望对您有所帮助！

如何改进 spaCy 中的德语文本分类模型

How to improve a German text classification model in spaCy

python

nlp

text-classification

spacy