如何使用 NLTK 或 pywsd 进行词形还原

How to do lemmatization using NLTK or pywsd

我知道我的解释很长,但我觉得有必要。希望有人有耐心和乐于助人:) 我正在做一个情绪分析项目 atm,我被困在预处理部分。我导入了 csv 文件,将其制成数据框,将 variables/columns 转换为正确的数据类型。然后我像这样进行标记化,在数据框中选择我想要标记化的变量(推文内容)(df_tweet1):

# Tokenization
tknzr = TweetTokenizer()
tokenized_sents = [tknzr.tokenize(str(i)) for i in df_tweet1['Tweet Content']]
for i in tokenized_sents:
    print(i)

输出是一个包含单词(标记)的列表。

然后我执行停用词删除:

# Stop word removal
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))
#add words that aren't in the NLTK stopwords list
new_stopwords = ['!', ',', ':', '&', '%', '.', '’']
new_stopwords_list = stop_words.union(new_stopwords)

clean_sents = []
for m in tokenized_sents:
    stop_m = [i for i in m if str(i).lower() not in new_stopwords_list]
    clean_sents.append(stop_m)

输出相同但没有停用词

接下来的两个步骤让我感到困惑(词性标注和词形还原)。我尝试了两件事:

1) 将前面的输出转换成字符串列表

new_test = [' '.join(x) for x in clean_sents]

因为我认为这将使我能够使用此代码将两个步骤合二为一:

from pywsd.utils import lemmatize_sentence

text = new_test
lemm_text = lemmatize_sentence(text, keepWordPOS=True)

我收到了这个错误: 类型错误:预期的字符串或类似字节的对象

2) 分别进行 POS 和词形还原。第一个使用 clean_sents 作为输入的 POS:

# PART-OF-SPEECH        
def process_content(clean_sents):
    try:
        tagged_list = []  
        for lst in clean_sents[:500]: 
            for item in lst:
                words = nltk.word_tokenize(item)
                tagged = nltk.pos_tag(words)
                tagged_list.append(tagged)
        return tagged_list

    except Exception as e:
        print(str(e))

output_POS_clean_sents = process_content(clean_sents)

输出是一个带有标签的单词列表 然后我想对这个输出进行词形还原,但是怎么做呢?我尝试了两个模块,但都给了我错误:

from pywsd.utils import lemmatize_sentence

lemmatized= [[lemmatize_sentence(output_POS_clean_sents) for word in s]
              for s in output_POS_clean_sents]

# AND

from nltk.stem.wordnet import WordNetLemmatizer

lmtzr = WordNetLemmatizer()
lemmatized = [[lmtzr.lemmatize(word) for word in s]
              for s in output_POS_clean_sents]
print(lemmatized)

错误分别是:

TypeError:预期的字符串或类似字节的对象

AttributeError: 'tuple' 对象没有属性 'endswith'

第一部分new_test是一个字符串列表。 lemmatize_sentence 需要一个字符串,所以传递 new_test 会引发一个错误,就像你得到的那样。您必须分别传递每个字符串,然后根据每个词形还原字符串创建一个列表。所以:

text = new_test
lemm_text = [lemmatize_sentence(sentence, keepWordPOS=True) for sentence in text]

应该创建一个词形还原句子列表。

我实际上曾经做过一个项目,看起来和你做的很相似。我做了以下函数来对字符串进行词形还原:

import lemmy, re

def remove_stopwords(lst):
    with open('stopwords.txt', 'r') as sw:
        #read the stopwords file 
        stopwords = sw.read().split('\n')
        return [word for word in lst if not word in stopwords]

def lemmatize_strings(body_text, language = 'da', remove_stopwords_ = True):
    """Function to lemmatize a string or a list of strings, i.e. remove prefixes. Also removes punctuations.

    -- body_text: string or list of strings
    -- language: language of the passed string(s), e.g. 'en', 'da' etc.
    """

    if isinstance(body_text, str):
        body_text = [body_text] #Convert whatever passed to a list to support passing of single string

    if not hasattr(body_text, '__iter__'):
        raise TypeError('Passed argument should be a sequence.')

    lemmatizer = lemmy.load(language) #load lemmatizing dictionary

    lemma_list = [] #list to store each lemmatized string 

    word_regex = re.compile('[a-zA-Z0-9æøåÆØÅ]+') #All charachters and digits i.e. all possible words

    for string in body_text:
        #remove punctuation and split words
        matches = word_regex.findall(string)

        #split words and lowercase them unless they are all caps
        lemmatized_string = [word.lower() if not word.isupper() else word for word in matches]

        #remove words that are in the stopwords file
        if remove_stopwords_:
            lemmatized_string = remove_stopwords(lemmatized_string)

        #lemmatize each word and choose the shortest word of suggested lemmatizations
        lemmatized_string = [min(lemmatizer.lemmatize('', word), key=len) for word in lemmatized_string]

        #remove words that are in the stopwords file
        if remove_stopwords_:
            lemmatized_string = remove_stopwords(lemmatized_string)

        lemma_list.append(' '.join(lemmatized_string))

    return lemma_list if len(lemma_list) > 1 else lemma_list[0] #return list if list was passed, else return string

如果你愿意,你可以看看那个,但不要觉得有义务。如果它能帮助你得到任何想法,我将非常高兴,我花了很多时间试图自己弄清楚!

告诉我:-)

如果您使用的是数据框,我建议您将预处理步骤的结果存储在一个新列中。通过这种方式,您始终可以检查输出,并且始终可以创建一个列表列表以用作代码行后记中模型的输入。这种方法的另一个优点是您可以轻松地可视化预处理线并在任何需要的地方添加其他步骤而不会感到困惑。

关于您的代码,它可以进行优化(例如,您可以同时执行停用词删除和标记化),我发现您执行的步骤有些混乱。例如,您执行多次词形还原,还使用不同的库,这样做是没有意义的。在我看来,nltk 工作得很好,我个人使用其他库来预处理推文,只是为了处理表情符号、url 和主题标签,所有与推文特别相关的东西。

# I won't write all the imports, you get them from your code
# define new column to store the processed tweets
df_tweet1['Tweet Content Clean'] = pd.Series(index=df_tweet1.index)

tknzr = TweetTokenizer()
lmtzr = WordNetLemmatizer()

stop_words = set(stopwords.words("english"))
new_stopwords = ['!', ',', ':', '&', '%', '.', '’']
new_stopwords_list = stop_words.union(new_stopwords)

# iterate through each tweet
for ind, row in df_tweet1.iterrows():

    # get initial tweet: ['This is the initial tweet']
    tweet = row['Tweet Content']

    # tokenisation, stopwords removal and lemmatisation all at once
    # out: ['initial', 'tweet']
    tweet = [lmtzr.lemmatize(i) for i in tknzr.tokenize(tweet) if i.lower() not in new_stopwords_list]

    # pos tag, no need to lemmatise again after.
    # out: [('initial', 'JJ'), ('tweet', 'NN')]
    tweet = nltk.pos_tag(tweet)

    # save processed tweet into the new column
    df_tweet1.loc[ind, 'Tweet Content Clean'] = tweet

总的来说,您只需要 4 行,一行用于获取推文字符串,两行用于预处理文本,另一行用于存储推文。您可以添加额外的处理步骤,注意每个步骤的输出(例如标记化 return 字符串列表,pos 标记 return 元组列表,您遇到麻烦的原因)。

如果需要,您可以创建一个列表列表,其中包含数据框中的所有推文:

# out: [[('initial', 'JJ'), ('tweet', 'NN')], [second tweet], [third tweet]]
all_tweets = [tweet for tweet in df_tweet1['Tweet Content Clean']]