源于标记化的词

Stemming on tokenized words

拥有此数据集:

>cleaned['text']
0         [we, have, a, month, open, #postdoc, position,...
1         [the, hardworking, biofuel, producers, in, iow...
2         [the, hardworking, biofuel, producers, in, iow...
3         [in, today, s, time, it, is, imperative, to, r...
4         [special, thanks, to, gaetanos, beach, club, o...
                                ...                        
130736    [demand, gw, sources, fossil, fuels, renewable...
130737         [there, s, just, not, enough, to, go, round]
130738    [the, answer, to, deforestation, lies, in, space]
130739    [d, filament, from, plastic, waste, regrind, o...
130740          [gb, grid, is, generating, gw, out, of, gw]
Name: text, Length: 130741, dtype: object

有没有简单的方法来阻止所有的单词?

您可能会找到更好的答案,但我个人认为 LemmInflect 库最适合词形还原和词形变化。

#!pip install lemminflect
from lemminflect import getLemma, getInflection, getAllLemmas

word = 'testing'
lemma = list(lemminflect.getAllLemmas(word, upos='NOUN').values())[0]
inflect = lemminflect.getInflection(lemma[0], tag='VBD')

print(word, lemma, inflect)
testing ('test',) ('tested',)

我会避免词干提取,因为如果您想使用语言模型或只是对任何上下文进行文本分类,它并不是很有用。词干提取和词形还原都生成变形词的词根形式。 区别在于词干可能不是一个实际的词,而词条是一个实际的语言词

变形与引理相反。


sentence = ['I', 'am', 'testing', 'my', 'new', 'library']

def l(sentence):
    lemmatized_sent = []
    for i in sentence:
        try: lemmatized_sent.append(list(getAllLemmas(i, upos='NOUN').values())[0][0])
        except: lemmatized_sent.append(i)
    return lemmatized_sent

l(sentence)
['I', 'be', 'test', 'my', 'new', 'library']
#To apply to dataframe use this
df['sentences'].apply(l)

一定要阅读 LemmInflect 的 documentation。您可以用它做更多事情。