如何在 WordNetLemmatizer 中传递词性？

Question

我正在预处理文本数据。但是，我面临着词形还原的问题。以下是示例文本：

'An 18-year-old boy was referred to prosecutors Thursday for allegedly stealing about ¥15 million (4,300) worth of cryptocurrency last year by hacking a digital currency storage website, police said.', 'The case is the first in Japan in which criminal charges have been pursued against a hacker over cryptocurrency losses, the police said.', '\n', 'The boy, from the city of Utsunomiya, Tochigi Prefecture, whose name is being withheld because he is a minor, allegedly stole the money after hacking Monappy, a website where users can keep the virtual currency monacoin, between Aug. 14 and Sept. 1 last year.', 'He used software called Tor that makes it difficult to identify who is accessing the system, but the police identified him by analyzing communication records left on the website’s server.', 'The police said the boy has admitted to the allegations, quoting him as saying, “I felt like I’d found a trick no one knows and did it as if I were playing a video game.”', 'He took advantage of a weakness in a feature of the website that enables a user to transfer the currency to another user, knowing that the system would malfunction if transfers were repeated over a short period of time.', 'He repeatedly submitted currency transfer requests to himself, overwhelming the system and allowing him to register more money in his account.', 'About 7,700 users were affected and the operator will compensate them.', 'The boy later put the stolen monacoins in an account set up by a different cryptocurrency operator, received payouts in a different cryptocurrency and bought items such as a smartphone, the police said.', 'According to the operator of Monappy, the stolen monacoins were kept using a system with an always-on internet connection, and those kept offline were not stolen.'

我的代码是：

import pandas as pd
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

df = pd.read_csv('All Articles.csv')
df['Articles'] = df['Articles'].str.lower()

stemming = PorterStemmer()
stops = set(stopwords.words('english'))
lemma = WordNetLemmatizer()

def identify_tokens(row):
    Articles = row['Articles']
    tokens = nltk.word_tokenize(Articles)
    token_words = [w for w in tokens if w.isalpha()]
    return token_words


df['words'] = df.apply(identify_tokens, axis=1)


def stem_list(row):
    my_list = row['words']
    stemmed_list = [stemming.stem(word) for word in my_list]
    return (stemmed_list)


df['stemmed_words'] = df.apply(stem_list, axis=1)


def lemma_list(row):
    my_list = row['stemmed_words']
    lemma_list = [lemma.lemmatize(word, pos='v') for word in my_list]
    return (lemma_list)


df['lemma_words'] = df.apply(lemma_list, axis=1)


def remove_stops(row):
    my_list = row['lemma_words']
    meaningful_words = [w for w in my_list if not w in stops]
    return (meaningful_words)


df['stem_meaningful'] = df.apply(remove_stops, axis=1)


def rejoin_words(row):
    my_list = row['stem_meaningful']
    joined_words = (" ".join(my_list))
    return joined_words


df['processed'] = df.apply(rejoin_words, axis=1)

从代码中可以清楚地看出我正在使用 pandas。但是这里我给出了示例文本。

我的问题区域是：

def lemma_list(row):
    my_list = row['stemmed_words']
    lemma_list = [lemma.lemmatize(word, pos='v') for word in my_list]
    return (lemma_list)

df['lemma_words'] = df.apply(lemma_list, axis=1)

尽管代码运行没有任何错误，但引理函数未按预期工作。

提前致谢。

Answer 1

在您上面的代码中，您试图对已被词干化的词进行词形还原。当词形还原器运行变成它无法识别的单词时，它会简单地 return 那个单词。例如，词干提取 offline 产生 offlin，当你运行通过词形还原器时，它只会返回相同的词，offlin。

您的代码应修改为词形还原 words，像这样...

def lemma_list(row):
    my_list = row['words']  # Note: line that is changed
    lemma_list = [lemma.lemmatize(word, pos='v') for word in my_list]
    return (lemma_list)
df['lemma_words'] = df.apply(lemma_list, axis=1)
print('Words: ',  df.ix[0,'words'])
print('Stems: ',  df.ix[0,'stemmed_words'])
print('Lemmas: ', df.ix[0,'lemma_words'])

这会产生...

Words:  ['and', 'those', 'kept', 'offline', 'were', 'not', 'stolen']
Stems:  ['and', 'those', 'kept', 'offlin',  'were', 'not', 'stolen']
Lemmas: ['and', 'those', 'keep', 'offline', 'be',   'not', 'steal']

哪个是正确的。

如何在 WordNetLemmatizer 中传递词性？

How to pass part-of-speech in WordNetLemmatizer?

nltk

lemmatization

part-of-speech

pandas

python-3.7